Data integration has been the information systems profession’s most enduring challenge.
It is almost four decades since Richard Nolan nominated data administration as the penultimate stage of his data processing maturity model, recognizing that the development of applications to support business processes would, unless prop-erly managed, create masses of duplicated and uncoordinated data.
Managing Data in Motion
Data integration has been the information systems profession’s most enduring challenge.
It is almost four decades since Richard Nolan nominated data administration
as the penultimate stage of his data processing maturity model, recognizing that
the development of applications to support business processes would, unless prop-
erly managed, create masses of duplicated and uncoordinated data.
In the early days of database technology, some of us had a dream that we
could achieve Nolan’s objective by building all of our organizations’ databases in
a coordinated manner to eliminate data duplication: “Capture data once, store it
in one place, and make it available to everyone who needs it” was the mantra.
Decentralized computing, packaged software, and plain old self-interest put an
end to that dream, but in many organizations the underlying ideas lived on in the
form of data management initiatives based on planning and coordination of data-
bases—notably in the form of enterprise data models. Their success was limited,
and organizations turned to tactical solutions to solve the most pressing problems.
They built interfaces to transfer data between applications rather than capturing it
multiple times, and they pulled it together for reporting purposes in what became
data warehouses and marts. This pragmatic approach embodied a willingness to
accept duplicated data as a given that was not attractive to the purists.
The tension between a strategic, organization-wide approach based on the dis-
position of data and after-the-fact spot solutions remains today. But the scale of
the problem has grown beyond anything envisaged in the 1970s.
We have witnessed extraordinary advances in computing power, storage tech-
nology, and development tools. Information technology has become ubiquitous in
business and government, and even midsized organizations count their applica-
tions in the thousands and their data in petabytes. But each new application, each
new solution, adds to the proliferation of data. Increasingly, these solutions are
“off the shelf,” offering the buyer little say in the database design and how it
overlaps with existing and future purchases.
Not only has the number of applications exploded, but the complexity of the
data within them is worlds away from the simple structures of early files and
databases. The Internet and smartphones generate enormous volumes of less
structured data, “data” embraces documents, audio and video, and cloud comput-
ing both extends the boundary of the organization’s data and further facilitates
acquisition of new applications.
The need for data integration has grown proportionately—or more correctly,
disproportionately, as the number of possible interfaces between systems
increases exponentially. What was once an opportunistic activity is becoming, in
many organizations, the focus of their systems’ development efforts. xv xvi Foreword
The last decade has seen important advances in tools to support data integra-
tion through messaging and virtualization. This book fills a vital gap in providing
an overview of this technology in a form that is accessible to nonspecialists:
planners, managers, and developers. April Reeve brings a rare combination of
business perspective and detailed knowledge from many years of designing,
implementing, and operating applications for organizations as an IT technician,
manager and, more recently, a consultant using the technologies in a variety of different environments.
Perhaps the most important audience will be data managers, in particular those
who have stuck resolutely to the static data management model and its associated
tools. As the management of data in motion comes to represent an increasing pro-
portion of the information technology budget, it demands strategic attention, and
data managers, with their organization-wide remit, are ideally placed to take respon-
sibility. The techniques in this book now form the mainstream of data integration
thinking and represent the current best hope of achieving the data administration
goals Nolan articulated so long ago. —Graeme Simsion Acknowledgements
First of all, I want to acknowledge the contribution of my husband, Tom Reeve,
who said I had to acknowledge him for making me dinner. During the course of
writing this book he made me dinner hundreds of times. Additionally, he put up
with my constant mantra that “I have to write” instead of doing so many other
things such as exercising or cleaning the house.
Of course I want to acknowledge the generosity of time and effort from all the
data management experts that gave me interviews to use in this book: •
Let me start with David Allen who was my co-presenter at the tutorial we
gave on this subject at the EDW conference in Chicago in 2011 and teaches
me something fascinating about XML and JSON every time I see him,
without him even knowing he is doing it. •
Krish Krishnan provided an abundance of information on his experience with
data integration in data warehousing and set a wonderfully high standard for the expert interviews. •
James Anderson jumped in quickly when I lost my previous data archiving expert
and it turns out we used to work together and now we are reconnected again. •
It was a pleasure to reconnect also with Adrienne Tannenbaum and get her
perspective on metadata and data integration. •
I’ve always said that the more experienced we are the more we hate our tools,
because we know the limitations, and it was great to get Dave Linthicum to
look back on some of his experiences with enterprise service buses and the
times when he might have had a love/hate relationship with them. •
I was very excited to get an interview with Dagna Gaythorpe, with all her
experience in data modeling and its challenges, who provided a practitioners
view on canonical data modeling that I found surprisingly optimistic. •
William McKnight helped me to get a handle on Hadoop, which is a critical
subject for a current book on data integration. •
I met John Haddad when we were both on a Big Data panel at EDW 2012 and
he generously offered to help me with this book, which he did by reading and
providing feedback on some of the early sections as well as a perfect interview on big data. •
Although Mike Ferguson couldn’t provide an interview, attending his
workshop did provide me with a great deal of my understanding of data
virtualization, which is a core concept in big data integration.
Karl Glenn was one of the technical reviewers of the book and I appreciated
his perspective and advice very much. It was amazing to discover someone who
lived on another continent and yet shared much of my understanding and perspec-
tive regarding best practices with data integration.
And one last shout out for my editor, Andrea Dierna, who is amazingly calm in a crisis. xvii
April Reeve has spent the last 25 years working as an enterprise architect and
program manager for large multinational organizations, developing data strate-
gies and managing development and operation of solutions. April is an expert
in multiple data management disciplines including data conversion, data ware-
housing, business intelligence, master data management, data integration, and
data governance. Currently, she is working for EMC2 Consulting as an Advisory
Consultant in the Enterprise Information Management practice. xix