Data Lake Management: Challenges and Opportunities| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

INTRODUCTION
A data lake is a massive collection of datasets that: (1)
may be hosted in different storage systems; (2) may vary
in their formats; (3) may not be accompanied by any use-ful metadata or may use different formats to describe their
metadata; and (4) may change autonomously over time. En-terprises have embraced data lakes for a variety of reasons.
First, data lakes decouple data producers (for example, op-erational systems) from data consumers (such as, reporting
and predictive analytics systems). This is important, espe-cially when the operational systems are legacy mainframes
which may not even be owned by the enterprise (as is com-mon in many enterprises such as banking and finance). For
data science, data lakes provide a convenient storage layer
for experimental data, both the input and output of data
analysis and learning tasks. The creation and use of data
can be done autonomously without coordination with other
programs or analysts. But the shared storage of a data lake
coupled with a (typically distributed) computational frame-work, provides the rudimentary infrastructure required for
sharing and re-use of massive datasets.

Thông tin:
4 trang 3 tháng trước

Bình luận

Vui lòng đăng nhập hoặc đăng ký để gửi bình luận.

Data Lake Management: Challenges and Opportunities| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

INTRODUCTION
A data lake is a massive collection of datasets that: (1)
may be hosted in different storage systems; (2) may vary
in their formats; (3) may not be accompanied by any use-ful metadata or may use different formats to describe their
metadata; and (4) may change autonomously over time. En-terprises have embraced data lakes for a variety of reasons.
First, data lakes decouple data producers (for example, op-erational systems) from data consumers (such as, reporting
and predictive analytics systems). This is important, espe-cially when the operational systems are legacy mainframes
which may not even be owned by the enterprise (as is com-mon in many enterprises such as banking and finance). For
data science, data lakes provide a convenient storage layer
for experimental data, both the input and output of data
analysis and learning tasks. The creation and use of data
can be done autonomously without coordination with other
programs or analysts. But the shared storage of a data lake
coupled with a (typically distributed) computational frame-work, provides the rudimentary infrastructure required for
sharing and re-use of massive datasets.

44 22 lượt tải Tải xuống
Data Lake Management: Challenges and Opportunities
Fatemeh Nargesian
University of Toronto
fnargesian@cs.toronto.edu
Erkang Zhu
University of Toronto
ekzhu@cs.toronto.edu
Ren
´
ee J. Miller
Northeastern University
miller@northeastern.edu
Ken Q. Pu
UOIT
ken.pu@uoit.ca
Patricia C. Arocena
TD Bank Group
prg@cs.toronto.edu
ABSTRACT
The ubiquity of data lakes has created fascinating new chal-
lenges for data management research. In this tutorial, we re-
view the state-of-the-art in data management for data lakes.
We consider how data lakes are introducing new problems
including dataset discovery and how they are changing the
requirements for classic problems including data extraction,
data cleaning, data integration, data versioning, and meta-
data management.
PVLDB Reference Format:
Fatemeh Naregsian, Erkang Zhu, Ren´ee J. Miller, Ken Q. Pu,
Patricia C. Arocena. Data Lake Management: Challenges and
Opportunities. PVLDB, 12(12): 1986-1989, 2019.
DOI: https://doi.org/10.14778/3352063.3352116
1. INTRODUCTION
A data lake is a massive collection of datasets that: (1)
may be hosted in different storage systems; (2) may vary
in their formats; (3) may not be accompanied by any use-
ful metadata or may use different formats to describe their
metadata; and (4) may change autonomously over time. En-
terprises have embraced data lakes for a variety of reasons.
First, data lakes decouple data producers (for example, op-
erational systems) from data consumers (such as, reporting
and predictive analytics systems). This is important, espe-
cially when the operational systems are legacy mainframes
which may not even be owned by the enterprise (as is com-
mon in many enterprises such as banking and finance). For
data science, data lakes provide a convenient storage layer
for experimental data, both the input and output of data
analysis and learning tasks. The creation and use of data
can be done autonomously without coordination with other
programs or analysts. But the shared storage of a data lake
coupled with a (typically distributed) computational frame-
work, provides the rudimentary infrastructure required for
sharing and re-use of massive datasets.
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For
any use beyond those covered by this license, obtain permission by emailing
info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 12, No. 12
ISSN 2150-8097.
DOI: https://doi.org/10.14778/3352063.3352116
While some of the data in a lake is extracted, trans-
formed, and loaded into existing database management sys-
tems (DBMS) or data warehouses, some of it may be exclu-
sively consumed on-demand by programming environments
to perform specific data analysis tasks. Moreover, the value
of some of this data is transient, meaning additional analy-
sis is required to create information with sufficient value to
load into a data warehouse. Even though some of this data
is not destined for traditional DBMS, there are still many
open and fascinating data management research problems.
Current data lakes provide reliable storage for datasets
together with computational frameworks (such as Hadoop
or Apache Spark), along with a suite of tools for doing data
governance (including identity management, authentication
and access control), data discovery, extraction, cleaning, and
integration. These tools help individual teams, data owners
and consumers alike, to create and use data in a data lake
using a self-serve model. But many challenges remain. First,
we are only at the beginning of being able to exploit the work
of others (their search, extraction, cleaning, and integration
effort) to help in new uses of a data lake. Systems like
IBM’s LabBook propose using the collective effort of data
scientists to recommend new data visualization or analysis
actions over new datasets or for new users [25]. Still chal-
lenges and opportunities remain in being able to collectively
exploit how data lakes are used. Second, data lakes are cur-
rently mostly intermediate repositories for data. Currently,
this data does not become actionable until it is cleaned and
integrated into a traditional DBMS or warehouse. A grand
challenge for data lake management systems is to support
on-demand query answering meaning data discovery, extrac-
tion, cleaning, and integration done at query time over mas-
sive collections of datasets that may have unknown struc-
ture, content, and data quality. Only then would the data
in data lakes become actionable.
2. DATA LAKE ARCHITECTURE
Figure 1 shows a high-level view of a common data lake.
The data sources may include legacy operational systems
(operating in Cobol or other formats), information scrapped
from the Web and social media, or information from for-
profit data brokers (such as Thompson Reuters or Lexis-
Nexis). Operational systems often export all data as strings
to avoid having to deal with type mismatches. The actual
type information and metadata may be represented in nu-
merous different formats. Other data may be pure docu-
ments, semi-structured logs, or social media information.
1986
Data lakes vary in their ability to support a unified view
over all or portions of the lake.
z
Versioning
DataLake
UnifiedData
Model
Metadata
Dataand
Metadata
Indices
Data
Sources
Ingestion
and
Extraction
Discoveryand
Integration
Cleaning
JSON
CSV
Parquet
Figure 1: Example Data Lake Management System.
3. CHALLENGES AND OPPORTUNITIES
Our tutorial will be focused around the following main
topics. For each, we will discuss the state-of-the-art and also
present a vision for open problems that need to be studied.
3.1 Data Ingestion
Modern data lakes support data ingestion for a large va-
riety of systems. Examples of ingestion are Web crawlers
that create data files containing web pages and Open Data
crawlers that archive Open Data repositories using Open
Data endpoints. Enterprises usually develop proprietary
software to handle end-to-end data ingestion.
The main task at this stage is bookkeeping of files for ver-
sioning and indexing purposes. Since ingestion often needs
to interface with external data sources with limited band-
width, it needs to be done with high degree of parallelism
and low latency. This means that ingestion does not per-
form any deep analysis of the downloaded data. However, it
is still possible to apply shallow data sketches on the down-
loaded contents and its metadata (if available) to maintain
a basic organization of the ingested datasets. Simple data
sketches (such as checksums) can also be used for duplicate
detection and multi-versioning of evolving datasets. Open
challenges in data ingestion are to support real-time inges-
tion of high velocity data with more sophisticated indexing
to make this data more immediately available for analysis.
3.2 Data Extraction
Data ingestion creates raw datasets in a specific data for-
mat (e.g. a textual or binary encodings). Data extraction is
the task of transforming this raw data to a pre-determined
data model. This abstraction from raw data to a data model
may be intertwined with preparation for tasks such as dis-
covery, integration, and cleaning. For example, CLAMS
unifies heterogeneous lake data into RDF for cleaning pur-
poses [14]. Table extraction allows the abstraction of data
into attributes (sets of values) that can be indexed for effi-
cient data discovery [43].
An example of a current extractor is DeepDive, which ex-
tracts relational data from the lakes of text, tables, and im-
ages by relying on a user-defined schema and a small number
of rules [41]. The automatic extraction of tables from web
pages has been very well studied for over 10 years [11, 19,
27]. An example of large scale table extraction is the Google
Web Table project which combines hand-written heuristics
and statistically-trained classifiers to detect relational tables
among HTML tables and assigns synthetic headers when it
is required [5]. Another project leverages the principles of
table construction to extract data tables with more com-
plex structures such as tables that contain group headers [1].
The recent DATAMARAN project extracts relational data
from semi-structured log files [18]. Table extraction from
adhoc spreadsheets remains a challenge [7]. Contributions
to extraction from the programming language community
include PADS, a declarative data description language, to-
gether with a compiler and tools for parsing and extracting
data from files [17, 45].
Data extraction is very well studied, yet opportunities re-
main for advancement. Today, extraction is done typically
one file at a time. We are not yet taking full advantage of
the “wisdom of the lake” to fully apply knowledge learned
from previous extractions (and from humans in the extrac-
tion loop) to future extractions.
3.3 Data Cleaning
While cleaning enterprise data has received significant at-
tention over the years, little work has been done on cleaning
within the context of a data lake. Logical and relational
data cleaning typically requires correct schema information
including integrity constraints [39]. However, in data lakes
the data may be stored in schema-less heterogeneous formats
or using schemas that are only specified at application level.
Although enriching data with schema information is one of
the main goals of metadata management systems, postpon-
ing cleaning to the later stages of data processing may result
in the propagation of errors through operations such as dis-
covery and integration. CLAMS is an early approach that
explicitly addresses the problem of cleaning raw heteroge-
neous lake data [14]. It enforces quality constraints on the
data right after ingestion and extraction. CLAMS loads
heterogeneous raw data sources in a unified data model and
enforces quality constraints on this model. An interesting
opportunity in lake data cleaning is leveraging the lake’s wis-
dom and performing collective data cleaning. Furthermore,
since data lake operations such as extraction can themselves
introduce systematic errors to the lake, it is important to
investigate the underlying conditions and operations that
cause these errors [38].
3.4 Dataset Discovery
Due to the sheer size of data in data lakes and the ab-
sence or incompleteness of a comprehensive schema or data
catalog, data discovery has become an important problem
in data lakes. To address the data discovery problem, some
solutions focus on generating and enriching data catalogs as
well as facilitating search on them. We consider these be-
low with other data lake metadata management techniques.
Other solutions operate on raw data (and existing metadata)
to perform discovery [9, 29, 43]. In query-driven discovery,
a user starts a search with a query (dataset or keywords)
and the goal is to find similar datasets to the query [4, 6]
or datasets that can be integrated (joined or unioned) with
the query [31, 43]. This is achieved by defining measures
and constructing efficient index structures that are special-
1987
ized for the unique characteristics of data lakes. Navigation
(or exploration) is an alternative to search. Data discovery
can be done by allowing a user to navigate over a linkage
graph [44, 42, 16, 15] or a hierarchical structure created to
facilitate exploration of the lake [30]). An interesting direc-
tion in discovery is analysis-driven discovery which is the
problem of augmenting a dataset with relevant data (new
training samples and features) with the purpose of perform-
ing learning tasks.
3.5 Metadata Management
Unlike data warehouses or DBMS, data lakes may not be
accompanied with descriptive and complete data catalogs.
Without explicit metadata information, a data lake can eas-
ily become a data swamp. Data catalogs are essential to
on-demand discovery and integration in data lakes as well as
raw data cleaning. In addition to extracting metadata from
sources and enriching data with meaningful metadata (such
as detailed data description and integrity constraints), meta-
data management systems need to support efficient storage
of metadata (specially when it becomes large) and query
answering over metadata.
An example of a metadata management system is
Google Dataset Search (GOODS) that extracts and col-
lects metadata for datasets generated and used internally
by Google [21]. The collected metadata ranges from
dataset-specific information such as owners, timestamps,
and schema to relationships among multiple datasets such as
their similarity and provenance. GOODS makes datasets ac-
cessible and searchable by exposing their collected metadata
in dataset profiles. Constance is another example that in
addition to extracting metadata from sources enriches data
sources by annotating data and metadata with semantic in-
formation [20]. Constance makes the generated metadata
accessible in a template-based query answering environment.
In contrast, the Ground project collects the context of data
which includes applications, behaviors, and changes of data
and stores the metadata in queryable graph structures [23].
Skluma extracts deeply embedded metadata, latent topics,
and contextual metadata from files in various formats in a
data lake [36] and allows topic-based discovery [36].
Metadata discovery provides the data abstraction that is
crucial to data understanding and discovery, yet opportuni-
ties remain in better extracting and connecting knowledge
from lakes and incorporating this knowledge into existing
(general or domain-specific) knowledge bases.
3.6 Data Integration
Traditional paradigms for integration, including data fed-
eration [22] and data exchange [13], have at best limited
value in data lakes. We will survey some Big Data Integra-
tion techniques that tackle dynamic data which may be of
very poor data quality [10] to consider how they apply to
data lakes. These techniques differ from pay-as-you-go data
integration that automatically construct a mediated schema
from various sources [8]. We will discuss the requirements
of on-demand integration, that is the task of integrating
raw data from a data lake at query time. The challenge
of on-demand integration lies in first finding datasets that
contain relevant data, and then integrating them in a mean-
ingful way. Relevant data may be modeled as data that
augments known entities with new attributes or properties,
as done in Infogather [40]. Alternatively, relevant data may
be a schema that is described by keyword queries expressed
over attribute names or other metadata [32, 34]. Schema
mapping permits the exchange of information between data
sets using different schemas [12, 37] and recent work per-
mits mapping discovery over incomplete (or inconsistent)
schemas and examples [26]. In sample-driven schema map-
ping, users describe the schema using a set of tuples [33, 35].
To give users flexibility in describing a schema, in multi-
resolution schema mapping, the user can describe schemas
using a set of constraints of various resolutions, such as in-
complete tuples, value ranges, and data types [24]. Nonethe-
less, on-demand schema mapping remains a grand challenge.
Importantely, discovery and integration are intertwined op-
erations in data lakes [34, 43, 31]. A new paradigm, called
query-driven discovery, finds tables that join or union with a
query table [43, 31]. Most of these solutions perform integra-
tion on relational data. However, to achieve on-demand data
integration on data lakes, we must be able to manage the
heterogeneity of lakes and potentially perform on-demand
extraction and cleaning as part of integration.
3.7 Dataset Versioning
Data lakes are dynamic. New datasets and new versions
of existing files enter the lake at the ingestion stage. Addi-
tionally, extractors can evolve over time and generate new
versions of raw data. As a result, data lake versioning is
a cross-cutting concern over all stages of a data lake. Of
course vanilla distributed file systems are not adequate for
versioning-related operations. For example, simply storing
all versions may be too costly for large datasets, and with-
out a good version manager, just using filenames to track
versions can be error-prone. In a data lake where there are
usually many users, it is even more important to clearly
maintain correct versions and versioning information. Fur-
thermore, as the number of versions increases, efficiently
and cost-effectively providing storage and retrieval of ver-
sions is going to be an important feature of a successful data
lake system. One early approach DataHub provides a git-
like interface by supporting operations such as version cre-
ation, branching, merging, and viewing difference between
datasets [2, 3]. Open challenges include managing schema
evolution and the peculiarity of data formats between ver-
sions and detection of versions.
4. OUTLINE AND SCOPE
We will focus on the challenges and open problems in data
lake management and the state-of-the-art techniques in the
areas we described in Section 3. The tutorial is designed for
data management and data science audience.
5. BIO SKETCHES
For five years, Nargesian, Pu, Zhu, and Miller have been
studying data lakes and developing new data discovery
paradigms. Their results were featured in a VLDB 2018
keynote [28]. Arocena is a Big Data practitioner with a PhD
in Data Integration and several years of industry experience
with data lakes. She currently works at TD Bank.
6. REFERENCES
[1] M. D. Adelfio and H. Samet. Schema extraction for tabular
data on the web. PVLDB, 6(6):421–432, 2013.
[2] A. P. Bhardwaj, S. Bhattacherjee, A. Chavan,
A. Deshpande, A. J. Elmore, S. Madden, and A. G.
1988
Parameswaran. DataHub: Collaborative data science &
dataset version management at scale. In CIDR, 2015.
[3] S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and
A. Parameswaran. Principles of dataset versioning:
Exploring the recreation/storage tradeoff. PVLDB,
8(12):1346–1357, 2015.
[4] W. Brackenbury, R. Liu, M. Mondal, A. J. Elmore, B. Ur,
K. Chard, and M. J. Franklin. Draining the data swamp: A
similarity-based approach. HILDA, pages 13:1–13:7, 2018.
[5] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and
Y. Zhang. Webtables: Exploring the power of tables on the
web. PVLDB, 1(1):538–549, 2008.
[6] M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data
integration for the relational web. PVLDB, 2(1):1090–1101,
2009.
[7] Z. Chen, S. Dadiomov, R. Wesley, G. Xiao, D. Cory, M. J.
Cafarella, and J. Mackinlay. Spreadsheet property detection
with rule-assisted active learning. In CIKM, pages
999–1008, 2017.
[8] A. Das Sarma, X. Dong, and A. Halevy. Bootstrapping
pay-as-you-go data integration systems. In SIGMOD, pages
861–874, 2008.
[9] D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang,
M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden,
M. Ouzzani, and N. Tang. The data civilizer system. In
CIDR, 2017.
[10] X. L. Dong and D. Srivastava. Big Data Integration.
Synthesis Lectures on Data Management. 2015.
[11] J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele,
A. Ahmadov, and W. Lehner. Building the dresden web
table corpus: A classification approach. In Symposium on
Big Data Computing, pages 41–50, 2015.
[12] R. Fagin, L. M. Haas, M. A. Hern´andez, R. J. Miller,
L. Popa, and Y. Velegrakis. Clio: Schema mapping creation
and data exchange. In Conceptual Modeling: Foundations
and Applications - Essays in Honor of John Mylopoulos,
pages 198–236, 2009.
[13] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data
exchange: semantics and query answering. Theory of
Computer Science, 336(1):89–124, 2005.
[14] M. H. Farid, A. Roatis, I. F. Ilyas, H. Hoffmann, and
X. Chu. CLAMS: bringing quality to data lakes. In
SIGMOD, pages 2089–2092, 2016.
[15] R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan,
S. Madden, and M. Stonebraker. Aurum: A data discovery
system. In ICDE, pages 1001–1012, 2018.
[16] R. C. Fernandez, E. Mansour, A. A. Qahtan, A. K.
Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani,
M. Stonebraker, and N. Tang. Seeping semantics: Linking
datasets using word embeddings for data discovery. In
ICDE, pages 989–1000, 2018.
[17] K. Fisher and D. Walker. The PADS project: an overview.
In ICDT, pages 11–17, 2011.
[18] Y. Gao, S. Huang, and A. Parameswaran. Navigating the
data lake with datamaran: Automatically extracting
structure from log datasets. In SIGMOD, pages 943–958,
2018.
[19] W. Gatterbauer and P. Bohunsky. Table extraction using
spatial reasoning on the CSS2 visual box model. In AAAI,
pages 1313–1318, 2006.
[20] R. Hai, S. Geisler, and C. Quix. Constance: An intelligent
data lake system. In SIGMOD, pages 2097–2100, 2016.
[21] A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis,
S. Roy, and S. E. Whang. Goods: Organizing google’s
datasets. In SIGMOD, pages 795–806, 2016.
[22] D. Heimbigner and D. McLeod. A federated architecture
for information management. ACM Trans. Inf. Syst.,
3(3):253–278, 1985.
[23] J. M. Hellerstein, V. Sreekanti, J. E. Gonzalez, J. Dalton,
A. Dey, S. Nag, K. Ramachandran, S. Arora,
A. Bhattacharyya, S. Das, M. Donsky, G. Fierro, C. She,
C. Steinbach, V. Subramanian, and E. Sun. Ground: A
data context service. In CIDR, 2017.
[24] Z. Jin, C. Baik, M. Cafarella, and H. V. Jagadish. Beaver:
Towards a declarative schema mapping. In HILDA, pages
10:1–10:4, 2018.
[25] E. Kandogan, M. Roth, P. M. Schwarz, J. Hui, I. G.
Terrizzano, C. Christodoulakis, and R. J. Miller. LabBook:
Metadata-driven social collaborative data analysis. In IEEE
Big Data, pages 431–440, 2015.
[26] A. Kimmig, A. Memory, R. J. Miller, and L. Getoor. A
collective, probabilistic approach to schema mapping. In
ICDE, pages 921–932, 2017.
[27] O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large
public corpus of web tables containing time and context
metadata. In WWW, pages 75–76, 2016.
[28] R. J. Miller. Open data integration. PVLDB,
11(12):2130–2139, 2018.
[29] R. J. Miller, F. Nargesian, E. Zhu, C. Christodoulakis,
K. Q. Pu, and P. Andritsos. Making open data transparent:
Data discovery on open data. IEEE Data Eng. Bull.,
41(2):59–70, 2018.
[30] F. Nargesian, K. Q. Pu, E. Zhu, B. G. Bashardoost, and
R. J. Miller. Optimizing organizations for navigating data
lakes, 2018. arXiv:1812.07024.
[31] F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. Table
union search on open data. PVLDB, 11(7):813–825, 2018.
[32] R. Pimplikar and S. Sarawagi. Answering table queries on
the web using column keywords. PVLDB, 5(10):908–919,
2012.
[33] L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven
schema mapping. In SIGMOD, pages 73–84, 2012.
[34] A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee,
F. Wu, R. Xin, and C. Yu. Finding related tables. In
SIGMOD, pages 817–828, 2012.
[35] Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding, and
L. Novik. Discovering queries based on example tuples. In
SIGMOD, pages 493–504, 2014.
[36] T. J. Skluzacek, R. Kumar, R. Chard, G. Harrison,
P. Beckman, K. Chard, and I. T. Foster. Skluma: An
extensible metadata extraction pipeline for disorganized
data. In IEEE International Conference on e-Science,
pages 256–266, 2018.
[37] B. ten Cate, P. G. Kolaitis, and W. C. Tan. Schema
mappings and data examples. In EDBT, pages 777–780,
2013.
[38] X. Wang, M. Feng, Y. Wang, X. L. Dong, and A. Meliou.
Error diagnosis and data profiling with data x-ray. PVLDB,
8(12):1984–1987, 2015.
[39] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and
I. F. Ilyas. Guided data repair. PVLDB, 4(5):279–289, 2011.
[40] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri.
Infogather: Entity augmentation and attribute discovery by
holistic matching with web tables. In SIGMOD, pages
97–108, 2012.
[41] C. Zhang, J. Shin, C. e, M. J. Cafarella, and F. Niu.
Extracting databases from dark data with deepdive. In
SIGMOD, pages 847–859, 2016.
[42] E. Zhu, D. Deng, F. Nargesian, and M. R. J. Josie: Overlap
set similarity search for finding joinable tables in data
lakes. In SIGMOD, 2019. To appear.
[43] E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller. LSH
ensemble: Internet-scale domain search. PVLDB,
9(12):1185–1196, 2016.
[44] E. Zhu, K. Q. Pu, F. Nargesian, and R. J. Miller.
Interactive navigation of open data linkages. PVLDB,
10(12):1837–1840, 2017.
[45] K. Q. Zhu, K. Fisher, and D. Walker. Learnpads++ :
Incremental inference of ad hoc data formats. In PADL,
pages 168–182, 2012.
1989
| 1/4

Preview text:

Data Lake Management: Challenges and Opportunities Fatemeh Nargesian Erkang Zhu Ren ´ee J. Miller University of Toronto University of Toronto Northeastern University fnargesian@cs.toronto.edu ekzhu@cs.toronto.edu miller@northeastern.edu Ken Q. Pu Patricia C. Arocena UOIT TD Bank Group ken.pu@uoit.ca prg@cs.toronto.edu ABSTRACT
While some of the data in a lake is extracted, trans-
formed, and loaded into existing database management sys-
The ubiquity of data lakes has created fascinating new chal-
tems (DBMS) or data warehouses, some of it may be exclu-
lenges for data management research. In this tutorial, we re-
sively consumed on-demand by programming environments
view the state-of-the-art in data management for data lakes.
to perform specific data analysis tasks. Moreover, the value
We consider how data lakes are introducing new problems
of some of this data is transient, meaning additional analy-
including dataset discovery and how they are changing the
sis is required to create information with sufficient value to
requirements for classic problems including data extraction,
load into a data warehouse. Even though some of this data
data cleaning, data integration, data versioning, and meta-
is not destined for traditional DBMS, there are still many data management.
open and fascinating data management research problems. PVLDB Reference Format:
Current data lakes provide reliable storage for datasets
Fatemeh Naregsian, Erkang Zhu, Ren´ ee J. Miller, Ken Q. Pu,
together with computational frameworks (such as Hadoop
Patricia C. Arocena. Data Lake Management: Challenges and
or Apache Spark), along with a suite of tools for doing data
Opportunities. PVLDB, 12(12): 1986-1989, 2019.
governance (including identity management, authentication
DOI: https://doi.org/10.14778/3352063.3352116
and access control), data discovery, extraction, cleaning, and
integration. These tools help individual teams, data owners
and consumers alike, to create and use data in a data lake 1. INTRODUCTION
using a self-serve model. But many challenges remain. First,
A data lake is a massive collection of datasets that: (1)
we are only at the beginning of being able to exploit the work
may be hosted in different storage systems; (2) may vary
of others (their search, extraction, cleaning, and integration
in their formats; (3) may not be accompanied by any use-
effort) to help in new uses of a data lake. Systems like
ful metadata or may use different formats to describe their
IBM’s LabBook propose using the collective effort of data
metadata; and (4) may change autonomously over time. En-
scientists to recommend new data visualization or analysis
terprises have embraced data lakes for a variety of reasons.
actions over new datasets or for new users [25]. Still chal-
First, data lakes decouple data producers (for example, op-
lenges and opportunities remain in being able to collectively
erational systems) from data consumers (such as, reporting
exploit how data lakes are used. Second, data lakes are cur-
and predictive analytics systems). This is important, espe-
rently mostly intermediate repositories for data. Currently,
cially when the operational systems are legacy mainframes
this data does not become actionable until it is cleaned and
which may not even be owned by the enterprise (as is com-
integrated into a traditional DBMS or warehouse. A grand
mon in many enterprises such as banking and finance). For
challenge for data lake management systems is to support
data science, data lakes provide a convenient storage layer
on-demand query answering meaning data discovery, extrac-
for experimental data, both the input and output of data
tion, cleaning, and integration done at query time over mas-
analysis and learning tasks. The creation and use of data
sive collections of datasets that may have unknown struc-
can be done autonomously without coordination with other
ture, content, and data quality. Only then would the data
programs or analysts. But the shared storage of a data lake
in data lakes become actionable.
coupled with a (typically distributed) computational frame-
work, provides the rudimentary infrastructure required for
sharing and re-use of massive datasets. 2. DATA LAKE ARCHITECTURE
Figure 1 shows a high-level view of a common data lake.
The data sources may include legacy operational systems
(operating in Cobol or other formats), information scrapped
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. To view a copy
from the Web and social media, or information from for-
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For
profit data brokers (such as Thompson Reuters or Lexis-
any use beyond those covered by this license, obtain permission by emailing
Nexis). Operational systems often export all data as strings
info@vldb.org. Copyright is held by the owner/author(s). Publication rights
to avoid having to deal with type mismatches. The actual
licensed to the VLDB Endowment.
type information and metadata may be represented in nu-
Proceedings of the VLDB Endowment, Vol. 12, No. 12
merous different formats. Other data may be pure docu- ISSN 2150-8097.
DOI: https://doi.org/10.14778/3352063.3352116
ments, semi-structured logs, or social media information. 1986
Data lakes vary in their ability to support a unified view
of rules [41]. The automatic extraction of tables from web
over all or portions of the lake.
pages has been very well studied for over 10 years [11, 19,
27]. An example of large scale table extraction is the Google
Web Table project which combines hand-written heuristics Data Sources
and statistically-trained classifiers to detect relational tables
among HTML tables and assigns synthetic headers when it Data Lake Discovery and
is required [5]. Another project leverages the principles of Integration
table construction to extract data tables with more com- Ingestion
plex structures such as tables that contain group headers [1]. and Unified Data
The recent DATAMARAN project extracts relational data Extraction Model
from semi-structured log files [18]. Table extraction from z
adhoc spreadsheets remains a challenge [7]. Contributions Cleaning Versioning
to extraction from the programming language community Metadata
include PADS, a declarative data description language, to-
gether with a compiler and tools for parsing and extracting Data and data from files [17, 45]. JSON CSV Parquet Metadata Indices
Data extraction is very well studied, yet opportunities re-
main for advancement. Today, extraction is done typically
one file at a time. We are not yet taking full advantage of
the “wisdom of the lake” to fully apply knowledge learned
Figure 1: Example Data Lake Management System.
from previous extractions (and from humans in the extrac-
tion loop) to future extractions. 3. CHALLENGES AND OPPORTUNITIES 3.3 Data Cleaning
Our tutorial will be focused around the following main
topics. For each, we will discuss the state-of-the-art and also
While cleaning enterprise data has received significant at-
present a vision for open problems that need to be studied.
tention over the years, little work has been done on cleaning
within the context of a data lake. Logical and relational 3.1 Data Ingestion
data cleaning typically requires correct schema information
Modern data lakes support data ingestion for a large va-
including integrity constraints [39]. However, in data lakes
riety of systems. Examples of ingestion are Web crawlers
the data may be stored in schema-less heterogeneous formats
that create data files containing web pages and Open Data
or using schemas that are only specified at application level.
crawlers that archive Open Data repositories using Open
Although enriching data with schema information is one of Data endpoints.
Enterprises usually develop proprietary
the main goals of metadata management systems, postpon-
software to handle end-to-end data ingestion.
ing cleaning to the later stages of data processing may result
The main task at this stage is bookkeeping of files for ver-
in the propagation of errors through operations such as dis-
sioning and indexing purposes. Since ingestion often needs
covery and integration. CLAMS is an early approach that
to interface with external data sources with limited band-
explicitly addresses the problem of cleaning raw heteroge-
width, it needs to be done with high degree of parallelism
neous lake data [14]. It enforces quality constraints on the
and low latency. This means that ingestion does not per-
data right after ingestion and extraction. CLAMS loads
form any deep analysis of the downloaded data. However, it
heterogeneous raw data sources in a unified data model and
is still possible to apply shallow data sketches on the down-
enforces quality constraints on this model. An interesting
loaded contents and its metadata (if available) to maintain
opportunity in lake data cleaning is leveraging the lake’s wis-
a basic organization of the ingested datasets. Simple data
dom and performing collective data cleaning. Furthermore,
sketches (such as checksums) can also be used for duplicate
since data lake operations such as extraction can themselves
detection and multi-versioning of evolving datasets. Open
introduce systematic errors to the lake, it is important to
challenges in data ingestion are to support real-time inges-
investigate the underlying conditions and operations that
tion of high velocity data with more sophisticated indexing cause these errors [38].
to make this data more immediately available for analysis. 3.4 Dataset Discovery 3.2 Data Extraction
Due to the sheer size of data in data lakes and the ab-
Data ingestion creates raw datasets in a specific data for-
sence or incompleteness of a comprehensive schema or data
mat (e.g. a textual or binary encodings). Data extraction is
catalog, data discovery has become an important problem
the task of transforming this raw data to a pre-determined
in data lakes. To address the data discovery problem, some
data model. This abstraction from raw data to a data model
solutions focus on generating and enriching data catalogs as
may be intertwined with preparation for tasks such as dis-
well as facilitating search on them. We consider these be-
covery, integration, and cleaning. For example, CLAMS
low with other data lake metadata management techniques.
unifies heterogeneous lake data into RDF for cleaning pur-
Other solutions operate on raw data (and existing metadata)
poses [14]. Table extraction allows the abstraction of data
to perform discovery [9, 29, 43]. In query-driven discovery,
into attributes (sets of values) that can be indexed for effi-
a user starts a search with a query (dataset or keywords) cient data discovery [43].
and the goal is to find similar datasets to the query [4, 6]
An example of a current extractor is DeepDive, which ex-
or datasets that can be integrated (joined or unioned) with
tracts relational data from the lakes of text, tables, and im-
the query [31, 43]. This is achieved by defining measures
ages by relying on a user-defined schema and a small number
and constructing efficient index structures that are special- 1987
ized for the unique characteristics of data lakes. Navigation
be a schema that is described by keyword queries expressed
(or exploration) is an alternative to search. Data discovery
over attribute names or other metadata [32, 34]. Schema
can be done by allowing a user to navigate over a linkage
mapping permits the exchange of information between data
graph [44, 42, 16, 15] or a hierarchical structure created to
sets using different schemas [12, 37] and recent work per-
facilitate exploration of the lake [30]). An interesting direc-
mits mapping discovery over incomplete (or inconsistent)
tion in discovery is analysis-driven discovery which is the
schemas and examples [26]. In sample-driven schema map-
problem of augmenting a dataset with relevant data (new
ping, users describe the schema using a set of tuples [33, 35].
training samples and features) with the purpose of perform-
To give users flexibility in describing a schema, in multi- ing learning tasks.
resolution schema mapping, the user can describe schemas
using a set of constraints of various resolutions, such as in- 3.5 Metadata Management
complete tuples, value ranges, and data types [24]. Nonethe-
Unlike data warehouses or DBMS, data lakes may not be
less, on-demand schema mapping remains a grand challenge.
accompanied with descriptive and complete data catalogs.
Importantely, discovery and integration are intertwined op-
Without explicit metadata information, a data lake can eas-
erations in data lakes [34, 43, 31]. A new paradigm, called
ily become a data swamp. Data catalogs are essential to
query-driven discovery, finds tables that join or union with a
on-demand discovery and integration in data lakes as well as
query table [43, 31]. Most of these solutions perform integra-
raw data cleaning. In addition to extracting metadata from
tion on relational data. However, to achieve on-demand data
sources and enriching data with meaningful metadata (such
integration on data lakes, we must be able to manage the
as detailed data description and integrity constraints), meta-
heterogeneity of lakes and potentially perform on-demand
data management systems need to support efficient storage
extraction and cleaning as part of integration.
of metadata (specially when it becomes large) and query 3.7 Dataset Versioning answering over metadata.
An example of a metadata management system is
Data lakes are dynamic. New datasets and new versions
Google Dataset Search (GOODS) that extracts and col-
of existing files enter the lake at the ingestion stage. Addi-
lects metadata for datasets generated and used internally
tionally, extractors can evolve over time and generate new by Google [21].
The collected metadata ranges from
versions of raw data. As a result, data lake versioning is
dataset-specific information such as owners, timestamps,
a cross-cutting concern over all stages of a data lake. Of
and schema to relationships among multiple datasets such as
course vanilla distributed file systems are not adequate for
their similarity and provenance. GOODS makes datasets ac-
versioning-related operations. For example, simply storing
cessible and searchable by exposing their collected metadata
all versions may be too costly for large datasets, and with-
in dataset profiles. Constance is another example that in
out a good version manager, just using filenames to track
addition to extracting metadata from sources enriches data
versions can be error-prone. In a data lake where there are
sources by annotating data and metadata with semantic in-
usually many users, it is even more important to clearly
formation [20]. Constance makes the generated metadata
maintain correct versions and versioning information. Fur-
accessible in a template-based query answering environment.
thermore, as the number of versions increases, efficiently
In contrast, the Ground project collects the context of data
and cost-effectively providing storage and retrieval of ver-
which includes applications, behaviors, and changes of data
sions is going to be an important feature of a successful data
and stores the metadata in queryable graph structures [23].
lake system. One early approach DataHub provides a git-
Skluma extracts deeply embedded metadata, latent topics,
like interface by supporting operations such as version cre-
and contextual metadata from files in various formats in a
ation, branching, merging, and viewing difference between
data lake [36] and allows topic-based discovery [36].
datasets [2, 3]. Open challenges include managing schema
Metadata discovery provides the data abstraction that is
evolution and the peculiarity of data formats between ver-
crucial to data understanding and discovery, yet opportuni-
sions and detection of versions.
ties remain in better extracting and connecting knowledge
from lakes and incorporating this knowledge into existing 4. OUTLINE AND SCOPE
(general or domain-specific) knowledge bases.
We will focus on the challenges and open problems in data
lake management and the state-of-the-art techniques in the 3.6 Data Integration
areas we described in Section 3. The tutorial is designed for
Traditional paradigms for integration, including data fed-
data management and data science audience.
eration [22] and data exchange [13], have at best limited
value in data lakes. We will survey some Big Data Integra- 5. BIO SKETCHES
tion techniques that tackle dynamic data which may be of
For five years, Nargesian, Pu, Zhu, and Miller have been
very poor data quality [10] to consider how they apply to
studying data lakes and developing new data discovery
data lakes. These techniques differ from pay-as-you-go data paradigms.
Their results were featured in a VLDB 2018
integration that automatically construct a mediated schema
keynote [28]. Arocena is a Big Data practitioner with a PhD
from various sources [8]. We will discuss the requirements
in Data Integration and several years of industry experience
of on-demand integration, that is the task of integrating
with data lakes. She currently works at TD Bank.
raw data from a data lake at query time. The challenge
of on-demand integration lies in first finding datasets that
contain relevant data, and then integrating them in a mean- 6. REFERENCES
[1] M. D. Adelfio and H. Samet. Schema extraction for tabular ingful way.
Relevant data may be modeled as data that
data on the web. PVLDB, 6(6):421–432, 2013.
augments known entities with new attributes or properties,
[2] A. P. Bhardwaj, S. Bhattacherjee, A. Chavan,
as done in Infogather [40]. Alternatively, relevant data may
A. Deshpande, A. J. Elmore, S. Madden, and A. G. 1988
Parameswaran. DataHub: Collaborative data science &
C. Steinbach, V. Subramanian, and E. Sun. Ground: A
dataset version management at scale. In CIDR, 2015.
data context service. In CIDR, 2017.
[3] S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and
[24] Z. Jin, C. Baik, M. Cafarella, and H. V. Jagadish. Beaver:
A. Parameswaran. Principles of dataset versioning:
Towards a declarative schema mapping. In HILDA, pages
Exploring the recreation/storage tradeoff. PVLDB, 10:1–10:4, 2018. 8(12):1346–1357, 2015.
[25] E. Kandogan, M. Roth, P. M. Schwarz, J. Hui, I. G.
[4] W. Brackenbury, R. Liu, M. Mondal, A. J. Elmore, B. Ur,
Terrizzano, C. Christodoulakis, and R. J. Miller. LabBook:
K. Chard, and M. J. Franklin. Draining the data swamp: A
Metadata-driven social collaborative data analysis. In IEEE
similarity-based approach. HILDA, pages 13:1–13:7, 2018.
Big Data, pages 431–440, 2015.
[5] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and
[26] A. Kimmig, A. Memory, R. J. Miller, and L. Getoor. A
Y. Zhang. Webtables: Exploring the power of tables on the
collective, probabilistic approach to schema mapping. In
web. PVLDB, 1(1):538–549, 2008. ICDE, pages 921–932, 2017.
[6] M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data
[27] O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large
integration for the relational web. PVLDB, 2(1):1090–1101,
public corpus of web tables containing time and context 2009.
metadata. In WWW, pages 75–76, 2016.
[7] Z. Chen, S. Dadiomov, R. Wesley, G. Xiao, D. Cory, M. J.
[28] R. J. Miller. Open data integration. PVLDB,
Cafarella, and J. Mackinlay. Spreadsheet property detection 11(12):2130–2139, 2018.
with rule-assisted active learning. In CIKM, pages
[29] R. J. Miller, F. Nargesian, E. Zhu, C. Christodoulakis, 999–1008, 2017.
K. Q. Pu, and P. Andritsos. Making open data transparent:
[8] A. Das Sarma, X. Dong, and A. Halevy. Bootstrapping
Data discovery on open data. IEEE Data Eng. Bull.,
pay-as-you-go data integration systems. In SIGMOD, pages 41(2):59–70, 2018. 861–874, 2008.
[30] F. Nargesian, K. Q. Pu, E. Zhu, B. G. Bashardoost, and
[9] D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang,
R. J. Miller. Optimizing organizations for navigating data
M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, lakes, 2018. arXiv:1812.07024.
M. Ouzzani, and N. Tang. The data civilizer system. In
[31] F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. Table CIDR, 2017.
union search on open data. PVLDB, 11(7):813–825, 2018.
[10] X. L. Dong and D. Srivastava. Big Data Integration.
[32] R. Pimplikar and S. Sarawagi. Answering table queries on
Synthesis Lectures on Data Management. 2015.
the web using column keywords. PVLDB, 5(10):908–919,
[11] J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, 2012.
A. Ahmadov, and W. Lehner. Building the dresden web
[33] L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven
table corpus: A classification approach. In Symposium on
schema mapping. In SIGMOD, pages 73–84, 2012.
Big Data Computing, pages 41–50, 2015.
[34] A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee,
[12] R. Fagin, L. M. Haas, M. A. Hern´ andez, R. J. Miller,
F. Wu, R. Xin, and C. Yu. Finding related tables. In
L. Popa, and Y. Velegrakis. Clio: Schema mapping creation SIGMOD, pages 817–828, 2012.
and data exchange. In Conceptual Modeling: Foundations
[35] Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding, and
and Applications - Essays in Honor of John Mylopoulos,
L. Novik. Discovering queries based on example tuples. In pages 198–236, 2009. SIGMOD, pages 493–504, 2014.
[13] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data
[36] T. J. Skluzacek, R. Kumar, R. Chard, G. Harrison,
exchange: semantics and query answering. Theory of
P. Beckman, K. Chard, and I. T. Foster. Skluma: An
Computer Science, 336(1):89–124, 2005.
extensible metadata extraction pipeline for disorganized
[14] M. H. Farid, A. Roatis, I. F. Ilyas, H. Hoffmann, and
data. In IEEE International Conference on e-Science,
X. Chu. CLAMS: bringing quality to data lakes. In pages 256–266, 2018.
SIGMOD, pages 2089–2092, 2016.
[37] B. ten Cate, P. G. Kolaitis, and W. C. Tan. Schema
[15] R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan,
mappings and data examples. In EDBT, pages 777–780,
S. Madden, and M. Stonebraker. Aurum: A data discovery 2013.
system. In ICDE, pages 1001–1012, 2018.
[38] X. Wang, M. Feng, Y. Wang, X. L. Dong, and A. Meliou.
[16] R. C. Fernandez, E. Mansour, A. A. Qahtan, A. K.
Error diagnosis and data profiling with data x-ray. PVLDB,
Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, 8(12):1984–1987, 2015.
M. Stonebraker, and N. Tang. Seeping semantics: Linking
[39] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and
datasets using word embeddings for data discovery. In
I. F. Ilyas. Guided data repair. PVLDB, 4(5):279–289, 2011. ICDE, pages 989–1000, 2018.
[40] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri.
[17] K. Fisher and D. Walker. The PADS project: an overview.
Infogather: Entity augmentation and attribute discovery by In ICDT, pages 11–17, 2011.
holistic matching with web tables. In SIGMOD, pages
[18] Y. Gao, S. Huang, and A. Parameswaran. Navigating the 97–108, 2012.
data lake with datamaran: Automatically extracting [41] C. Zhang, J. Shin, C. R´
e, M. J. Cafarella, and F. Niu.
structure from log datasets. In SIGMOD, pages 943–958,
Extracting databases from dark data with deepdive. In 2018. SIGMOD, pages 847–859, 2016.
[19] W. Gatterbauer and P. Bohunsky. Table extraction using
[42] E. Zhu, D. Deng, F. Nargesian, and M. R. J. Josie: Overlap
spatial reasoning on the CSS2 visual box model. In AAAI,
set similarity search for finding joinable tables in data pages 1313–1318, 2006.
lakes. In SIGMOD, 2019. To appear.
[20] R. Hai, S. Geisler, and C. Quix. Constance: An intelligent
[43] E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller. LSH
data lake system. In SIGMOD, pages 2097–2100, 2016.
ensemble: Internet-scale domain search. PVLDB,
[21] A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, 9(12):1185–1196, 2016.
S. Roy, and S. E. Whang. Goods: Organizing google’s
[44] E. Zhu, K. Q. Pu, F. Nargesian, and R. J. Miller.
datasets. In SIGMOD, pages 795–806, 2016.
Interactive navigation of open data linkages. PVLDB,
[22] D. Heimbigner and D. McLeod. A federated architecture 10(12):1837–1840, 2017.
for information management. ACM Trans. Inf. Syst.,
[45] K. Q. Zhu, K. Fisher, and D. Walker. Learnpads++ : 3(3):253–278, 1985.
Incremental inference of ad hoc data formats. In PADL,
[23] J. M. Hellerstein, V. Sreekanti, J. E. Gonzalez, J. Dalton, pages 168–182, 2012.
A. Dey, S. Nag, K. Ramachandran, S. Arora,
A. Bhattacharyya, S. Das, M. Donsky, G. Fierro, C. She, 1989
Document Outline

  • introduction
  • Data Lake Architecture
  • Challenges and Opportunities
    • Data Ingestion
    • Data Extraction
    • Data Cleaning
    • Dataset Discovery
    • Metadata Management
    • Data Integration
    • Dataset Versioning
  • Outline and Scope
  • Bio Sketches
  • References