Time Series Management Systems: A Survey Jensen, Søren Kejser; Pedersen, Torben Bach; Thomsen, Christian
IEEE transactions on knowledge and data engineering,
11/2017, Letnik:
29, Številka:
11
Journal Article
Recenzirano
Odprti dostop
The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale from an Internet of things (IoT) device located in a household to ...enormous distributed Cyber-Physical Systems (CPSs) producing large volumes of data at high velocity. To store and analyze these vast amounts of data, specialized Time Series Management Systems (TSMSs) have been developed to overcome the limitations of general purpose Database Management Systems (DBMSs) for times series management. In this paper, we present a thorough analysis and classification of TSMSs developed through academic or industrial research and documented through publications. Our classification is organized into categories based on the architectures observed during our analysis. In addition, we provide an overview of each system with a focus on the motivational use case that drove the development of the system, the functionality for storage and querying of time series a system implements, the components the system is composed of, and the capabilities of each system with regard to Stream Processing and Approximate Query Processing (AQP). Last, we provide a summary of research directions proposed by other researchers in the field and present our vision for a next generation TSMS.
This open access book explores the dataspace paradigm as a best-effort approach to data management within data ecosystems. It establishes the theoretical foundations and principles of real-time ...linked dataspaces as a data platform for intelligent systems. The book introduces a set of specialized best-effort techniques and models to enable loose administrative proximity and semantic integration for managing and processing events and streams. The book is divided into five major parts: Part I “Fundamentals and Concepts” details the motivation behind and core concepts of real-time linked dataspaces, and establishes the need to evolve data management techniques in order to meet the challenges of enabling data ecosystems for intelligent systems within smart environments. Further, it explains the fundamental concepts of dataspaces and the need for specialization in the processing of dynamic real-time data. Part II “Data Support Services” explores the design and evaluation of critical services, including catalog, entity management, query and search, data service discovery, and human-in-the-loop. In turn, Part III “Stream and Event Processing Services” addresses the design and evaluation of the specialized techniques created for real-time support services including complex event processing, event service composition, stream dissemination, stream matching, and approximate semantic matching. Part IV “Intelligent Systems and Applications” explores the use of real-time linked dataspaces within real-world smart environments. In closing, Part V “Future Directions” outlines future research challenges for dataspaces, data ecosystems, and intelligent systems. Readers will gain a detailed understanding of how the dataspace paradigm is now being used to enable data ecosystems for intelligent systems within smart environments. The book covers the fundamental theory, the creation of new techniques needed for support services, and lessons learned from real-world intelligent systems and applications focused on sustainability. Accordingly, it will benefit not only researchers and graduate students in the fields of data management, big data, and IoT, but also professionals who need to create advanced data management platforms for intelligent systems, smart environments, and data ecosystems.
This open access book is a step-by-step introduction on how shell scripting can help solve many of the data processing tasks that Health and Life specialists face everyday with minimal software ...dependencies. The examples presented in the book show how simple command line tools can be used and combined to retrieve data and text from web resources, to filter and mine literature, and to explore the semantics encoded in biomedical ontologies. To store data this book relies on open standard text file formats, such as TSV, CSV, XML, and OWL, that can be open by any text editor or spreadsheet application. The first two chapters, Introduction and Resources, provide a brief introduction to the shell scripting and describe popular data resources in Health and Life Sciences. The third chapter, Data Retrieval, starts by introducing a common data processing task that involves multiple data resources. Then, this chapter explains how to automate each step of that task by introducing the required commands line tools one by one. The fourth chapter, Text Processing, shows how to filter and analyze text by using simple string matching techniques and regular expressions. The last chapter, Semantic Processing, shows how XPath queries and shell scripting is able to process complex data, such as the graphs used to specify ontologies. Besides being almost immutable for more than four decades and being available in most of our personal computers, shell scripting is relatively easy to learn by Health and Life specialists as a sequence of independent commands. Comprehending them is like conducting a new laboratory protocol by testing and understanding its procedural steps and variables, and combining their intermediate results. Thus, this book is particularly relevant to Health and Life specialists or students that want to easily learn how to process data and text, and which in return may facilitate and inspire them to acquire deeper bioinformatics skills in the future.
Designing Data Spaces Otto, Boris; ten Hompel, Michael; Wrobel, Stefan
2022, 2022-07-21
eBook
Odprti dostop
This open access book provides a comprehensive view on data ecosystems and platform economics from methodical and technological foundations up to reports from practical implementations and ...applications in various industries. To this end, the book is structured in four parts: Part I “Foundations and Contexts” provides a general overview about building, running, and governing data spaces and an introduction to the IDS and GAIA-X projects. Part II “Data Space Technologies” subsequently details various implementation aspects of IDS and GAIA-X, including eg data usage control, the usage of blockchain technologies, or semantic data integration and interoperability. Next, Part III describes various “Use Cases and Data Ecosystems” from various application areas such as agriculture, healthcare, industry, energy, and mobility. Part IV eventually offers an overview of several “Solutions and Applications”, eg including products and experiences from companies like Google, SAP, Huawei, T-Systems, Innopay and many more. Overall, the book provides professionals in industry with an encompassing overview of the technological and economic aspects of data spaces, based on the International Data Spaces and Gaia-X initiatives. It presents implementations and business cases and gives an outlook to future developments. In doing so, it aims at proliferating the vision of a social data market economy based on data spaces which embrace trust and data sovereignty.
Amazon Aurora Verbitski, Alexandre; Gupta, Anurag; Saha, Debanjan ...
Proceedings of the 2018 International Conference on Management of Data,
05/2018
Conference Proceeding
Amazon Aurora is a high-throughput cloud-native relational database offered as part of Amazon Web Services (AWS). One of the more novel differences between Aurora and other relational databases is ...how it pushes redo processing to a multi-tenant scale-out storage service, purpose-built for Aurora. Doing so reduces networking traffic, avoids checkpoints and crash recovery, enables failovers to replicas without loss of data, and enables fault-tolerant storage that heals without database involvement. Traditional implementations that leverage distributed storage would use distributed consensus algorithms for commits, reads, replication, and membership changes and amplify cost of underlying storage. In this paper, we describe how Aurora avoids distributed consensus under most circumstances by establishing invariants and leveraging local transient state. Doing so improves performance, reduces variability, and lowers costs.
Abstract
FlyBase provides a centralized resource for the genetic and genomic data of Drosophila melanogaster. As FlyBase enters our fourth decade of service to the research community, we reflect on ...our unique aspects and look forward to our continued collaboration with the larger research and model organism communities. In this study, we emphasize the dedicated reports and tools we have constructed to meet the specialized needs of fly researchers but also to facilitate use by other research communities. We also highlight ways that we support the fly community, including an external resources page, help resources, and multiple avenues by which researchers can interact with FlyBase.
The only how-to guide offering a unified, systemic approach to acquiring, cleaning, and managing data in REvery experienced practitioner knows that preparing data for modeling is a painstaking, ...time-consuming process. Adding to the difficulty is that most modelers learn the steps involved in cleaning and managing data piecemeal, often on the fly, or they develop their own ad hoc methods. This book helps simplify their task by providing a unified, systematic approach to acquiring, modeling, manipulating, cleaning, and maintaining data in R. Starting with the very basics, data scientists Samuel E. Buttrey and Lyn R. Whitaker walk readers through the entire process. From what data looks like and what it should look like, they progress through all the steps involved in getting data ready for modeling. They describe best practices for acquiring data from numerous sources; explore key issues in data handling, including text/regular expressions, big data, parallel processing, merging, matching, and checking for duplicates; and outline highly efficient and reliable techniques for documenting data and recordkeeping, including audit trails, getting data back out of R, and more.The only single-source guide to R data and its preparation, it describes best practices for acquiring, manipulating, cleaning, and maintaining dataBegins with the basics and walks readers through all the steps necessary to get data ready for the modeling processProvides expert guidance on how to document the processes described so that they are reproducibleWritten by seasoned professionals, it provides both introductory and advanced techniquesFeatures case studies with supporting data and R code, hosted on a companion websiteA Data Scientist's Guide to Acquiring, Cleaning and Managing Data in R is a valuable working resource/bench manual for practitioners who collect and analyze data, lab scientists and research associates of all levels of experience, and graduate-level data mining students.
The variety of data is one of the most challenging issues for the research and practice in data management systems. The data are naturally organized in different formats and models, including ...structured data, semi-structured data, and unstructured data. In this survey, we introduce the area of multi-model DBMSs that build a single database platform to manage multi-model data. Even though multi-model databases are a newly emerging area, in recent years, we have witnessed many database systems to embrace this category. We provide a general classification and multi-dimensional comparisons for the most popular multi-model databases. This comprehensive introduction on existing approaches and open problems, from the technique and application perspective, make this survey useful for motivating new multi-model database approaches, as well as serving as a technical reference for developing multi-model database applications.
Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes ...of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book. The book demonstrates how adept analysts can find relationships among data objects held in disparate Big Data resources, when the data objects are endowed with semantic support (i.e., organized in classes of uniquely identified data objects). Readers will learn how their data can be integrated with data from other resources, and how the data extracted from Big Data resources can be used for purposes beyond those imagined by the data creators. * Learn general methods for specifying Big Data in a way that is understandable to humans and to computers * Avoid the pitfalls in Big Data design and analysis * Understand how to create and use Big Data safely and responsibly with a set of laws, regulations and ethical standards that apply to the acquisition, distribution and integration of Big Data resources
From Theory to Practice Chu, Shumo; Balazinska, Magdalena; Suciu, Dan
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data,
05/2015
Conference Proceeding
Big data analytics often requires processing complex queries using massive parallelism, where the main performance metrics is the communication cost incurred during data reshuffling. In this paper, ...we describe a system that can compute efficiently complex join queries, including queries with cyclic joins, on a massively parallel architecture. We build on two independent lines of work for multi-join query evaluation: a communication-optimal algorithm for distributed evaluation, and a worst-case optimal algorithm for sequential evaluation. We evaluate these algorithms together, then describe novel, practical optimizations for both algorithms.