Many massive data processing applications nowadays often need long, continuous, and uninterrupted data accesses. Distributed file systems are used as the back-end storage to provide the global ...namespace management and reliability guarantee. Due to increasing hardware failures and software issues with the growing system scale, metadata service reliability has become a critical issue as it has a direct impact on file and directory operations. Existing metadata management mechanisms can provide fault tolerance capability to some level but are inadequate. They often have limitations in system availability, state consistence, and performance overhead and lack an effective mechanism to offer metadata reliability. This paper introduces a novel highly reliable metadata service to address these issues in large-scale file systems. Different from traditional strategies, this proposed reliable metadata service adopts a new active-standby architecture for fault tolerance and uses a holistic approach to improve file system availability. A new shared storage pool (SSP) is designed for transparent metadata synchronization and replication between active and standby servers. Based on the SSP, a new policy called multiple actives multiple standbys (MAMS) is presented to perform metadata service recovery in case of failures. A new global state recovery strategy and a smart client fault tolerance mechanism are achieved to maintain the continuity of metadata service. We have implemented such highly reliable metadata service in a prototype file system CFS (Clover file system) and conducted extensive tests to evaluate it. Experimental results confirm that it can significantly improve file system reliability with fast failover under different failure scenarios while having negligible influence on performance. Compared with typical reliability designs in Hadoop Avatar, Hadoop HA, and Boom-FS file systems, the mean-time-to-recovery (MTTR) with the highly reliable metadata service was reduced by 80.23, 65.46 and 28.13 percent, respectively.
This paper uses visual network analysis (VNA) to do an exploratory data analysis of instapoetry, focusing on the use and co-occurrences of hashtags connected to Scandinavian instapoetry. The goal was ...to reveal and explore some of the networked patterns and processes connected to the production and distribution of instapoetry, by using digital methods. Through descriptive measurements of metadata of instapoetry and a visual network analysis, this paper has been able to identify characterizations of such patterns. Findings reveal that the Scandinavian instapoetry community is small and Norwegian dominant, with an established use of semantically close words related to poetry being used as tags to organize and make the poetry findable. In addition, the hashtags also reveal larger popular themes and topics. Reoccurring themes are emotions, interpersonal relations, and mental health. While they at one scale state something about the content of the poems, some of these tags bring instapoetry into other communities and interest spheres on Instagram, with prominent examples being interest spaces of specific mental illnesses, but also, by way of one high-visibility instapoet, into the interest sphere of nature photography and Norwegian tourism promotion.
While storing invoice content as metadata to avoid paper document processing may be the future trend, almost all of daily issued invoices are still printed on paper or generated in digital formats ...such as PDFs. In this paper, we introduce the OCRMiner system for information extraction from scanned document images which is based on text analysis techniques in combination with layout features to extract indexing metadata of (semi-)structured documents. The system is designed to process the document in a similar way a human reader uses, i.e. to employ different layout and text attributes in a coordinated decision. The system consists of a set of interconnected modules that start with (possibly erroneous) character-based output from a standard OCR system and allow to apply different techniques and to expand the extracted knowledge at each step. Using an open source OCR, the system is able to recover the invoice data in 90% for English and in 88% for the Czech set.
•Invoice information extraction is an inevitable task in bulk document processing.•The current best systems are based on less flexible predefined invoice templates.•OCRMiner uses content and layout processing technique inspired by the human way.•The system is prepared and evaluated with multilingual environment.•The training process uses a very small development set of a few invoices.•OCRMiner reaches accuracy comparable to systems trained on huge curated datasets.
Over the past two decades, we have witnessed an exponential increase of data production in the world. So-called big data generally come from transactional systems, and even more so from the Internet ...of Things and social media. They are mainly characterized by volume, velocity, variety and veracity issues. Big data-related issues strongly challenge traditional data management and analysis systems. The concept of data lake was introduced to address them. A data lake is a large, raw data repository that stores and manages all company data bearing any format. However, the data lake concept remains ambiguous or fuzzy for many researchers and practitioners, who often confuse it with the Hadoop technology. Thus, we provide in this paper a comprehensive state of the art of the different approaches to data lake design. We particularly focus on data lake architectures and metadata management, which are key issues in successful data lakes. We also discuss the pros and cons of data lakes and their design alternatives.
This paper reports on a demonstration of YAMZ (Yet Another Metadata Zoo) as a mechanism for building community consensus around metadata terms. The demonstration is motivated by the complexity of the ...metadata standards environment and the need for more user-friendly approaches for researchers to achieve vocabulary consensus. The paper reviews a series of metadata standardization challenges, explores crowdsourcing factors that offer possible solutions, and introduces the YAMZ system. A YAMZ demonstration is presented with members of the Toberer materials science laboratory at the Colorado School of Mines, where there is a need to confirm and maintain a shared understanding for the vocabulary supporting research documentation, data management, and their larger metadata infrastructure. The demonstration involves three key steps: 1) Sampling terms for the demonstration, 2) Engaging graduate student researchers in the demonstration, and 3) Reflecting on the demonstration. The results of these steps, including examples of the dialog provenance among lab members and voting, show the ease with YAMZ can facilitate building metadata vocabulary consensus. The conclusion discusses implications and highlights next steps.
•Comprehensive investigation of metadata complexities in online portals and data repositories.•Automating layout preparation to streamline metadata files for accurate element detection.•Improving ...metadata structural quality through the use of syntactic preparators.•Elevating the contextual quality of metadata by applying semantic preparators.•Conducting performance evaluations on the proposed methodologies to gain valuable insights.
With the exponential growth of data production, the generation of metadata has become an integral part of the process. Metadata plays a crucial role in facilitating enhanced data analytics, data integration, and resource management by offering valuable insights. However, inconsistencies arise due to deviations from standards in metadata recording, including missing attribute information, publishing URLs, and provenance. Furthermore, the recorded metadata may exhibit inconsistencies, such as varied value formats, special characters, and inaccurately entered values. Addressing these inconsistencies through metadata preparation can greatly enhance the user experience during data management tasks.
This paper introduces MDPrep, a system that explores the usability and applicability of data preparation techniques in improving metadata quality. Our approach involves three steps: (1) detecting and identifying problematic metadata elements and structural issues, (2) employing a keyword-based approach to enhance metadata elements and a syntax-based approach to rectify structural metadata issues, and (3) comparing the outcomes to ensure improved readability and reusability of prepared metadata files.
This paper takes a technical services perspective on user experience (UX) research into student searching behaviors. In this observational study, students were free to search as they normally would ...while conducting research for an upcoming essay or assignment. Researchers took careful note of the search process, including how searches were composed and which metadata fields students looked at in their results lists. The findings of the study, and how local technical services staff responded to them, are discussed in this paper. The project was a useful way to prioritize the work of technical services based on insights from user searching behavior and to help ensure library resources are discoverable in the most effective manner.