Many massive data processing applications nowadays often need long, continuous, and uninterrupted data accesses. Distributed file systems are used as the back-end storage to provide the global ...namespace management and reliability guarantee. Due to increasing hardware failures and software issues with the growing system scale, metadata service reliability has become a critical issue as it has a direct impact on file and directory operations. Existing metadata management mechanisms can provide fault tolerance capability to some level but are inadequate. They often have limitations in system availability, state consistence, and performance overhead and lack an effective mechanism to offer metadata reliability. This paper introduces a novel highly reliable metadata service to address these issues in large-scale file systems. Different from traditional strategies, this proposed reliable metadata service adopts a new active-standby architecture for fault tolerance and uses a holistic approach to improve file system availability. A new shared storage pool (SSP) is designed for transparent metadata synchronization and replication between active and standby servers. Based on the SSP, a new policy called multiple actives multiple standbys (MAMS) is presented to perform metadata service recovery in case of failures. A new global state recovery strategy and a smart client fault tolerance mechanism are achieved to maintain the continuity of metadata service. We have implemented such highly reliable metadata service in a prototype file system CFS (Clover file system) and conducted extensive tests to evaluate it. Experimental results confirm that it can significantly improve file system reliability with fast failover under different failure scenarios while having negligible influence on performance. Compared with typical reliability designs in Hadoop Avatar, Hadoop HA, and Boom-FS file systems, the mean-time-to-recovery (MTTR) with the highly reliable metadata service was reduced by 80.23, 65.46 and 28.13 percent, respectively.
This paper uses visual network analysis (VNA) to do an exploratory data analysis of instapoetry, focusing on the use and co-occurrences of hashtags connected to Scandinavian instapoetry. The goal was ...to reveal and explore some of the networked patterns and processes connected to the production and distribution of instapoetry, by using digital methods. Through descriptive measurements of metadata of instapoetry and a visual network analysis, this paper has been able to identify characterizations of such patterns. Findings reveal that the Scandinavian instapoetry community is small and Norwegian dominant, with an established use of semantically close words related to poetry being used as tags to organize and make the poetry findable. In addition, the hashtags also reveal larger popular themes and topics. Reoccurring themes are emotions, interpersonal relations, and mental health. While they at one scale state something about the content of the poems, some of these tags bring instapoetry into other communities and interest spheres on Instagram, with prominent examples being interest spaces of specific mental illnesses, but also, by way of one high-visibility instapoet, into the interest sphere of nature photography and Norwegian tourism promotion.
While storing invoice content as metadata to avoid paper document processing may be the future trend, almost all of daily issued invoices are still printed on paper or generated in digital formats ...such as PDFs. In this paper, we introduce the OCRMiner system for information extraction from scanned document images which is based on text analysis techniques in combination with layout features to extract indexing metadata of (semi-)structured documents. The system is designed to process the document in a similar way a human reader uses, i.e. to employ different layout and text attributes in a coordinated decision. The system consists of a set of interconnected modules that start with (possibly erroneous) character-based output from a standard OCR system and allow to apply different techniques and to expand the extracted knowledge at each step. Using an open source OCR, the system is able to recover the invoice data in 90% for English and in 88% for the Czech set.
•Invoice information extraction is an inevitable task in bulk document processing.•The current best systems are based on less flexible predefined invoice templates.•OCRMiner uses content and layout processing technique inspired by the human way.•The system is prepared and evaluated with multilingual environment.•The training process uses a very small development set of a few invoices.•OCRMiner reaches accuracy comparable to systems trained on huge curated datasets.
Over the past two decades, we have witnessed an exponential increase of data production in the world. So-called big data generally come from transactional systems, and even more so from the Internet ...of Things and social media. They are mainly characterized by volume, velocity, variety and veracity issues. Big data-related issues strongly challenge traditional data management and analysis systems. The concept of data lake was introduced to address them. A data lake is a large, raw data repository that stores and manages all company data bearing any format. However, the data lake concept remains ambiguous or fuzzy for many researchers and practitioners, who often confuse it with the Hadoop technology. Thus, we provide in this paper a comprehensive state of the art of the different approaches to data lake design. We particularly focus on data lake architectures and metadata management, which are key issues in successful data lakes. We also discuss the pros and cons of data lakes and their design alternatives.
Making data platforms smarter with MOSES Francia, Matteo; Gallinucci, Enrico; Golfarelli, Matteo ...
Future generation computer systems,
December 2021, 2021-12-00, Letnik:
125
Journal Article
Recenzirano
The rise of data platforms has enabled the collection and processing of huge volumes of data, but has opened to the risk of losing their control. Collecting proper metadata about raw data and ...transformations can significantly reduce this risk. In this paper we propose MOSES, a technology-agnostic, extensible, and customizable framework for metadata handling in big data platforms. The framework hinges on a metadata repository that stores information about the objects in the big data platform and the processes that transform them. MOSES provides a wide range of functionalities to different types of users of the platform. Differently from previous high-level proposals, MOSES is fully implemented and it was not conceived for a specific technology. Besides discussing the rationale and the features of MOSES, in this paper we describe its implementation and we test it on a real case study. The ultimate goal is to take a significant step forward towards proving that metadata handling in big data platforms is feasible and beneficial.
•MOSES is a framework for metadata handling in big data platforms. MOSES provides a wide range of functionalities to different types of users of the platform.•MOSES is fully implemented and it has been tested on real case studies.•The goal is to prove that metadata handling in big data platforms is feasible and beneficial.
This paper reports on a demonstration of YAMZ (Yet Another Metadata Zoo) as a mechanism for building community consensus around metadata terms. The demonstration is motivated by the complexity of the ...metadata standards environment and the need for more user-friendly approaches for researchers to achieve vocabulary consensus. The paper reviews a series of metadata standardization challenges, explores crowdsourcing factors that offer possible solutions, and introduces the YAMZ system. A YAMZ demonstration is presented with members of the Toberer materials science laboratory at the Colorado School of Mines, where there is a need to confirm and maintain a shared understanding for the vocabulary supporting research documentation, data management, and their larger metadata infrastructure. The demonstration involves three key steps: 1) Sampling terms for the demonstration, 2) Engaging graduate student researchers in the demonstration, and 3) Reflecting on the demonstration. The results of these steps, including examples of the dialog provenance among lab members and voting, show the ease with YAMZ can facilitate building metadata vocabulary consensus. The conclusion discusses implications and highlights next steps.
Abstract To explore complex biological questions, it is often necessary to access various data types from public data repositories. As the volume and complexity of biological sequence data grow, ...public repositories face significant challenges in ensuring that the data is easily discoverable and usable by the biological research community. To address these challenges, the National Center for Biotechnology Information (NCBI) has created NCBI Datasets. This resource provides straightforward, comprehensive, and scalable access to biological sequences, annotations, and metadata for a wide range of taxa. Following the FAIR (Findable, Accessible, Interoperable, and Reusable) data management principles, NCBI Datasets offers user-friendly web interfaces, command-line tools, and documented APIs, empowering researchers to access NCBI data seamlessly. The data is delivered as packages of sequences and metadata, thus facilitating improved data retrieval, sharing, and usability in research. Moreover, this data delivery method fosters effective data attribution and promotes its further reuse. This paper outlines the current scope of data accessible through NCBI Datasets and explains various options for exploring and downloading the data.