-
Thông tin
-
Hỏi đáp
Tổng hợp bài giảng của thầy Nguyễn Hữu Đức| Bài giảng môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội
Tổng hợp bài giảng của thầy Nguyễn Hữu Đức| Bài giảng môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội. Tài liệu gồm 674 trang giúp bạn đọc ôn tập và đạt kết quả cao trong kỳ thi sắp tới. Mời bạn đọc đón xem.
Môn: Quản trị dữ liệu và trực quan hóa
Trường: Đại học Bách Khoa Hà Nội
Thông tin:
Tác giả:
Preview text:
Chapter 1 Introduction to data management and visualization How big is big data? 3 How big is big data? 4
Data science: The 4th paradigm for scientific discovery 5 Big data in 2008 6 Big data sources • E-commerce • Social networks • Internet of things
• Data-intensive experiments (bioinformatics, quantum physics, etc) 7 Data is the new oil 8 Big data 5'V
Big data is a term for data sets that are so large or complex that
traditional data processing application software is inadequate to deal with them (wikipedia) 9 Big data – big value source: wipro.com 10
Introduction to data management What is Data Management
• Data management is the development and execution
of architectures, policies, practices and procedures
in order to manage the information lifecycle needs of
an enterprise in an effective manner Poor Data Management
• 94% of companies suffering from a catastrophic data
loss do not survive – 43% never reopen and 51% close
within two years. (University of Texas)
• 7 out of 10 small firms that experience a major data
loss go out of business within a year. (DTI/Price Waterhouse Coopers)
• 50% of all tape backups fail to restore. (Gartner)
• 25% of all PC users suffer from data loss each year (Gartner) Why Data Management: Foundation to Advance Science
• Data is a valuable asset – it is expensive and time consuming to collect • Data should be managed to:
o maximize the effective use and value of data and information assets
o continually improve the quality including data accuracy, integrity,
integration, timeliness of data capture and presentation, relevance and usefulness
o ensure appropriate use of data and information o facilitate data sharing
o ensure sustainability and accessibility in long term for re-use in science
A new image processing technique reveals something not before seen in this Hubble Space
Telescope image taken 11 years ago: A faint planet (arrows), the outermost of three discovered
with ground-based telescopes last year around the young star HR 8799.D. Lafrenière et al., Astrophysical Journal Letters “Planet hidden in Hubble tters JLe
archives” Science News (Feb. 27, 2009) et al., Ap frenière La D.
“The first thing it tells you is how valuable maintaining long-term archives can be.
Here is a major discovery that’s been lurking in the data for about 10 years!”
comments Matt Mountain, director of the Space Telescope Science Institute in Baltimore, which operates Hubble.
“The second thing its tel s you is having a wel calibrated archive is necessary but not
sufficient to make breakthroughs — it also takes a very innovative group of people to
develop very smart extraction routines that can get rid of al the artifacts to reveal the
planet hidden under al that telescope and detector structure.”
Data Management Facilitates Sharing and Re-use…
Where a majority of data end up now…
Imagine if data were more accessible…. Data Life Cycle Plan Analyze Collect Integrate Assure Discover Describe Preserve Planning
• Consider data management before you collect data
• What kind of data will be collected?
• Which methods will be used (sensors, samples, etc.)?
• What data formats/standards are appropriate? • How will the data be used?
• How will you share the data? • Will your methods satisfy • Funding requirements
• Policies for access, sharing, reuse
• Budget – most of the time tihis is overlooked! • Output • Formal document