35 trang 29 lượt tải

Go Big (with Data Lake Architecture) or Go Home| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

Objective:
Understand how the traditional data landscape is changing,
what opportunities big data presents and what architectures
allow you to maximize the benefits to your organization.

Môn: Quản trị dữ liệu và trực quan hóa 50 tài liệu

Trường: Đại học Bách Khoa Hà Nội 2.8 K tài liệu

Tác giả:

Trịnh Thảo Anh

11 tháng trước

Danh sách Quiz

BR008

Microsoft Machine Learning

& Data Science Summit

September 26

–

27 | Atlanta, GA

Go Big (with Data Lake Architecture)

or Go Home!

Omid

Afnan

Session objectives and key takeaways

Objective:

Understand how the traditional data landscape is changing,

what opportunities big data presents and what architectures

allow you to maximize the benefits to your organization.

Key Takeaways:

Data lake architectures can be additive to your data

warehouse

Azure Data Lake makes building big data environments easy

The traditional data warehouse

“… data warehousing has reached

the most significant tipping point

since its inception. The biggest,

possibly most elaborate data

management system in IT is

changing.”

–

Gartner, “The State of Data Warehousing”*

Data sources

OLTP

ERP

CRM

LOB

ETL

Data warehouse

BI and analytics

Dashboards

Reporting

* Donald Feinberg, Mark Beyer, Merv Adrian, Roxane Edjlali (Gartner), The State of Data Warehousing in 2012 (Stamford, CT.: Gartner, 2012)

Big Data is driving transformative changes

“

Big data

high

volume

high

velocity

and/or

high

variety

information assets that demand

cost

effective

innovative forms of information processing that enable

enhanced insight

decision making

, and

process

automation

.”

–

Gartner, Big Data Definition*

* Gartner, Big Data (Stamford, CT.: Gartner, 2016), URL: http://www.gartner.com/it-glossary/big-data/

Big Data is driving transformative changes

Cost CultureData

Characteristics

Big Data is driving transformative changes

Cost

Culture

Data

Characteristics

Traditional Big Data

Relational

(with highly modeled schema)

All Data

(with schema agility)

Expensive

(storage and compute capacity)

Commodity

(storage and compute capacity)

Rear-view reporting

(using relational algebra)

Intelligent action

(using relational algebra AND ML,

graph, streaming, image processing)

“I can see us…creating predictive, context-aware financial

services applications that give information based on the time

and where the customer is.”

Billy Lo

Head of Enterprise Architecture

Example:

Culture of experimentation

Tangerine instantly adapts to

customer feedback to offer customers

what they want, when they want it

Scenario

Lack of insight for targeted campaigns

Inability to support data growth

Solution

Azure HDInsight

(Hadoop-as-a-service) with the Analytics

Platform System enables instant analysis of social sentiment

and customer feedback

across digital, face-to-face and

phone interactions.

Result

•

Reduced time to customer insight

•

Ability to make changes to campaigns or adjust product

rollouts based on real-time customer reactions

•

Ability to offer incentives and new services to retain—

and

grow—its customer base

Why Data Lakes?

ETL pipeline

Dedicated ETL tools (e.g. SSIS)

Defined schema

Queries

Results

Relational

LOB Applications

Traditional business analytics process

1. Start with end-user requirements to identify desired reports

and analysis

2. Define corresponding database schema and queries

3. Identify the required data sources

4. Create a Extract-Transform-Load (ETL) pipeline to extract

required data (curation) and transform it to target schema (‘schema-

on-write’)

5. Create reports, analyze data

All data not immediately required is discarded or archived

Store indefinitely

Analyze

See results

Gather data

from all sources

Iterate

New big data thinking:

All data has value

• All data has potential value

• Data hoarding

• No defined schema—stored in native format

• Schema is imposed and transformations are done at query time (schema-on-read).

• Apps and users interpret the data as they see fit

The data lake and warehouse

Interactive queries

Batch queries

Machine Learning

Real-time analytics

Devices

Sensors

Video

Web

Social

Click

stream

Devices

Sensors

Video

Social

Clickstream

ETL pipeline

Defined schema

Queries

Results

Relational

LOB Applications

Web

Meta-Data,

Joins

Cooked

Data

Cooked

Data

Dashboards

Reports

Exploration

However, Big Data is not easy…

*Gartner: Survey Analysis – Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015)

Obtaining skills

and capabilities

Determining how

to get value

Integrating with

existing IT investments

Microsoft made the transition to Big Data

We wanted to build better

products based on real usage and

experimentation,

So we built…

A data lake for everyone to put their data

Tools approachable by any developer

Machine learning tools for collaborating

across large experiment models

Result

…

Used at Microsoft across Office, Xbox Live, Azure,

Windows, Bing and Skype

10K+ Developers running diverse workloads and

scenarios

Exabytes of data under management

1 2 3 4 5 6 7

Data Stored

Windows

SMSG

Live

Bing

CRM/Dynamics

Xbox Live

Office365

Malware Protection

Microsoft Stores

Commerce Risk

Skype

LCA

Exchange

Yammer

Patterns for

Big Data

Big Data Analytics – Data Flow

DATA

Business

apps

Custom

apps

Sensors

and devices

INTELLIGENCE ACTION

People

Preparation, Analytics and

Machine Learning

Azure Data Lake Store

Ingestion

Bulk Ingestion

Event Ingestion

Discovery

Azure

Data Catalog

Visualization

Power BI

Event ingestion patterns

Business

apps

Custom

apps

Sensors

and devices

Events

Azure Data Lake Store

Transformed

Data

Real Time Dashboards

Power BI

Raw Events

Azure Event

Hubs

Kafka

Event Collection

Azure

Stream Analytics

Spark Streaming

Stream Processing

Lambda architecture

Event hubs

Machine Learning

Flatten &

Metadata Join

Data Factory: Move Data, Orchestrate, Schedule, and Monitor

Machine

Learning

Azure SQL

Data Warehouse

Power BI

INGEST PREPARE ANALYZE PUBLISH

ASA Job Rule

CONSUME

DATA

SOURCES

Web/LOB

Dashboards

On Premise

Hot Path

Cold Path

Archived Data

Data Lake

Store

Sensors (IoT,

Devices, Mobile)

Reference Data

Event hubs

ASA Job Rule

Event hubs

Real-time

Scoring

Aggregated

Data

Data Lake

Store

Logs (CSV, JSON,

XML…)

Data Lake

Store

Data Lake

Analytics

Batch Scoring

Offline

Training

Hourly, Daily,

Monthly Roll-Ups

Cortana

Email Server

Leading Computer Manufacturer / Retailer

How They Did It: Analyzing Clickstream to Provide Real-time Recommendations Online

• How They Did It

• Collect clickstream data

• In tab separated text files

• Adding 22 new files per hour ~5-18

MB/file

• Currently 1TB and growing

• Spin up Hadoop

• Use Hive scripts because of SQL-like

syntax

• Extracts click behavior like buys,

additions to carts, reviews etc. and

assigns scores

• Jobs run hourly

• Currently 8-nodes with plans to 16

Clickstream,

Recommendation

BK1

HDInsight Cluster

AzureML

Event Hub

NRT

AzureML

Azure SQL

Blog

Storage

NoSQL

Storage

Targeted email

Template data

IaaS VM

Training/validation data

Azure Service for Target Email

MB ase

Targeted Email

Scored data

Recommendations

Deterministic

Non

Deterministic

Web logs

Persisted

Storage

Azure Service for Recommendations

Blog

Storage

Blog

Storage

Visitor

Information

Service

to user

User

clicks

Web logs

Scored data

B Click feedback

data

Catched

data

Catched

data

To be scored

Capture

Deterministic

User segment info

Omniture

Product

Catalog

Website.com

Making Big

Data Easy

Bấm Tải xuống để xem toàn bộ.

Preview text:

Microsoft Machine Learning & Data Science Summit
September 26 – 27 | Atlanta, GA BR008
Go Big (with Data Lake Architecture) or Go Home! Omid Afnan
Session objectives and key takeaways Objective:
Understand how the traditional data landscape is changing,
what opportunities big data presents and what architectures
allow you to maximize the benefits to your organization. Key Takeaways:
Data lake architectures can be additive to your data warehouse
Azure Data Lake makes building big data environments easy
The traditional data warehouse BI and analytics
“… data warehousing has reached
the most significant tipping point Dashboards Reporting Data warehouse
since its inception. The biggest, possibly most elaborate data management system in IT is ETL changing.” Data sources
– Gartner, “The State of Data Warehousing”* OLTP ERP CRM LOB
* Donald Feinberg, Mark Beyer, Merv Adrian, Roxane Edjlali (Gartner), The State of Data Warehousing in 2012 (Stamford, CT.: Gartner, 2012) 4
Big Data is driving transformative changes
“Big data is high-volume, high-velocity and/or high-
variety information assets that demand cost-effective,
innovative forms of information processing that enable
enhanced insight, decision making, and process automation.”
– Gartner, Big Data Definition*
* Gartner, Big Data (Stamford, CT.: Gartner, 2016), URL: http://www.gartner.com/it-glossary/big-data/
Big Data is driving transformative changes Data Cost Culture Characteristics
Big Data is driving transformative changes Traditional Big Data Data Characteristics Relational All Data (with highly modeled schema) (with schema agility) Cost Expensive Commodity (storage and compute capacity) (storage and compute capacity) Culture Rear-view reporting Intelligent action (using relational algebra)
(using relational algebra AND ML,
graph, streaming, image processing) Example: Culture of experimentation
Tangerine instantly adapts to
customer feedback to offer customers
what they want, when they want it
Scenario Lack of insight for targeted campaigns
Inability to support data growth
Azure HDInsight (Hadoop-as-a-service) with the Analytics
Solution Platform System enables instant analysis of social sentiment
and customer feedback across digital, face-to-face and phone interactions.
• Reduced time to customer insight
• Ability to make changes to campaigns or adjust product Result
rollouts based on real-time customer reactions
“I can see us…creating predictive, context-aware financial
• Ability to offer incentives and new services to retain—and
services applications that give information based on the time grow—its customer base and where the customer is.” Billy Lo
Head of Enterprise Architecture Why Data Lakes?
Traditional business analytics process
1. Start with end-user requirements to identify desired reports New requirements and analysis d I a a d n t e
2. Define corresponding database schema and queries d a n reports q s t c i u f h y e e analytics ri m
3. Identify the required data sources e o s a Create D
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema (‘schema- Crea on-write’) p te ip E e T l L Identify ine data sources
5. Create reports, analyze data
Dedicated ETL tools (e.g. SSIS) Relational ETL pipeline Queries LOB Applications Defined schema Results
All data not immediately required is discarded or archived 10
New big data thinking: Al data has value
• All data has potential value • Data hoarding
• No defined schema—stored in native format
• Schema is imposed and transformations are done at query time (schema-on-read).
• Apps and users interpret the data as they see fit Iterate Gather data Store indefinitely Analyze See results from all sources 11 The data lake and warehouse Batch queries Dashboards Devices Social Reports Devices Interactive queries Exploration Cooked Data Web Video Web Real-time analytics Sensors Machine Learning Queries Social Clickstream Sensors Click- stream Video Meta-Data, Cooked Joins Data Relational Results ETL pipeline LOB Applications Defined schema
However, Big Data is not easy… Obtaining skills Determining how Integrating with and capabilities to get value existing IT investments
*Gartner: Survey Analysis – Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015)
Microsoft made the transition to Big Data We wanted to build better Data Stored
products based on real usage and experimentation, So we built… Xbox Live
A data lake for everyone to put their data
Tools approachable by any developer Office365 LCA
Machine learning tools for collaborating Live across large experiment models Bing SMSG Yammer Result… CRM/Dynamics Skype
Used at Microsoft across Office, Xbox Live, Azure, Exchange Windows Windows, Bing and Skype Malware Protection Microsoft Stores
10K+ Developers running diverse workloads and Commerce Risk scenarios
Exabytes of data under management 1 2 3 4 5 6 7 Patterns for Big Data
Big Data Analytics – Data Flow Ingestion Discovery Azure Data Catalog Business apps Preparation, Analytics and Bulk Ingestion Machine Learning Visualization People Power BI Custom apps Sensors Event Ingestion and devices Azure Data Lake Store DATA INTELLIGENCE ACTION Event ingestion patterns Power BI Real Time Dashboards Business Azure apps Azure Event Stream Analytics Hubs Events Events Transformed Data Kafka Custom apps Spark Streaming Event Col ection Stream Processing Sensors and devices Raw Events Azure Data Lake Store Lambda architecture DATA INGEST PREPARE ANALYZE PUBLISH CONSUME SOURCES Machine Learning Hot Path Real-time Scoring Reference Data Event hubs ASA Job Rule Cortana Event hubs Flatten & Event hubs Sensors (IoT, ASA Job Rule Metadata Join Devices, Mobile) Aggregated Hourly, Daily, Archived Data Data Monthly Roll-Ups Power BI Data Lake Data Lake Data Lake Data Lake Store Store Analytics Store Offline Training Machine Batch Scoring Learning Azure SQL Web/LOB Data Warehouse Dashboards Logs (CSV, JSON,
Data Factory: Move Data, Orchestrate, Schedule, and Monitor XML…) On Premise Cold Path Clickstream,
Leading Computer Manufacturer / Retailer Recommendation
How They Did It: Analyzing Clickstream to Provide Real-time Recommendations Online Azure Service for Target Email HDInsight Cluster AzureML • How They Did It Training/validation data • Collect clickstream data
• In tab separated text files Catched data B Click feedback Catched data data To be scored
• Adding 22 new files per hour ~5-18 Scored data MB/file Blog Blog MB ase Targeted email Storage Storage Template data • Currently 1TB and growing IaaS VM Web logs Email Server • Spin up Hadoop Visitor Information Omniture Product Website.com Catalog Service User Email clicks to user
• Use Hive scripts because of SQL-like Capture Targeted Email syntax Web logs User segment info Scored data Deterministic Recommendations Deterministic
• Extracts click behavior like buys,
additions to carts, reviews etc. and Non-Deterministic assigns scores NRT • Jobs run hourly Event Hub AzureML Azure SQL Blog NoSQL Persisted
• Currently 8-nodes with plans to 16 DE Storage Storage Storage
BK1Azure Service for Recommendations Making Big Data Easy

Go Big (with Data Lake Architecture) or Go Home| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

Tài liệu liên quan:

Perception in Visualization| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

39 studies about human perception in 30 minutes| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

My steps to learn about Apache NiFi| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

Text Visualization Browser| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

Data Warehouse and OLAP| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội