The literature provides a wide range of techniques to assess and improve the quality of data. Due to the diversity and complexity of these techniques, research has recently focused on defining ...methodologies that help the selection, customization, and application of data quality assessment and improvement techniques. The goal of this article is to provide a systematic and comparative description of such methodologies. Methodologies are compared along several dimensions, including the methodological phases and steps, the strategies and techniques, the data quality dimensions, the types of data, and, finally, the types of information systems addressed by each methodology. The article concludes with a summary description of each methodology.
Social media have the potential to provide timely information about emergency situations and sudden events. However, finding relevant information among the millions of posts being added every day can ...be difficult, and in current approaches developing an automatic data analysis project requires time and technical skills. This work presents a new approach for the analysis of social media posts, based on configurable automatic classification combined with Citizen Science methodologies. The process is facilitated by a set of flexible, automatic and open-source data processing tools called the Citizen Science Solution Kit. The kit provides a comprehensive set of tools that can be used and personalized in different situations, particularly during natural emergencies, starting from images and text contained in the posts. The tools can be employed by citizen scientists for filtering, classifying, and geolocating the content with a human-in-the-loop approach to support the data analyst, including feedback and suggestions on how to configure the automated tools, and techniques to gather inputs from citizens. Using flooding scenario as a guiding example, this paper illustrates the structure and functioning of the different tools proposed to support citizens scientists in their projects, and a methodological approach to their use. The process is then validated by discussing three case studies based on the Albania earthquake of 2019, the Covid-19 pandemic, and the Thailand floods of 2021. The results suggest that a flexible approach to tools composition and configuration can support a timely setup of an analysis project by citizen scientists, especially in case of emergencies in unexpected locations.
Smart urban transportation management can be considered as a multifaceted big data challenge. It strongly relies on the information collected into multiple, widespread, and heterogeneous data sources ...as well as on the ability to extract actionable insights from them. Besides data, full stack (from platform to services and applications) Information and Communications Technology (ICT) solutions need to be specifically adopted to address smart cities challenges. Smart urban transportation management is one of the key use cases addressed in the context of the EUBra-BIGSEA ( Europe-Brazil Collaboration of Big Data Scientific Research through Cloud-Centric Applications) project. This paper specifically focuses on the City Administration Dashboard, a public transport analytics application that has been developed on top of the EUBra-BIGSEA platform and used by the Municipality stakeholders of Curitiba, Brazil, to tackle urban traffic data analysis and planning challenges. The solution proposed in this paper joins together a scalable big and fast data analytics platform, a flexible and dynamic cloud infrastructure, data quality and entity matching algorithms as well as security and privacy techniques. By exploiting an interoperable programming framework based on Python Application Programming Interface (API), it allows an easy, rapid and transparent development of smart cities applications.
Crowdsourcing enables one to leverage on the intelligence and wisdom of potentially large groups of individuals toward solving problems. Common problems approached with crowdsourcing are labeling ...images, translating or transcribing text, providing opinions or ideas, and similar—all tasks that computers are not good at or where they may even fail altogether. The introduction of humans into computations and/or everyday work, however, also poses critical, novel challenges in terms of quality control, as the crowd is typically composed of people with unknown and very diverse abilities, skills, interests, personal objectives, and technological resources. This survey studies quality in the context of crowdsourcing along several dimensions, so as to define and characterize it and to understand the current state of the art. Specifically, this survey derives a quality model for crowdsourcing tasks, identifies the methods and techniques that can be used to assess the attributes of the model, and the actions and strategies that help prevent and mitigate quality problems. An analysis of how these features are supported by the state of the art further identifies open issues and informs an outlook on hot future research directions.
Pervasive sensing is increasing our ability to monitor the status of patients not only when they are hospitalized but also during home recovery. As a result, lots of data are collected and are ...available for multiple purposes. If operations can take advantage of timely and detailed data, the huge amount of data collected can also be useful for analytics. However, these data may be unusable for two reasons: data quality and performance problems. First, if the quality of the collected values is low, the processing activities could produce insignificant results. Second, if the system does not guarantee adequate performance, the results may not be delivered at the right time. The goal of this document is to propose a data utility model that considers the impact of the quality of the data sources (e.g., collected data, biographical data, and clinical history) on the expected results and allows for improvement of the performance through utility-driven data management in a Fog environment. Regarding data quality, our approach aims to consider it as a context-dependent problem: a given dataset can be considered useful for one application and inadequate for another application. For this reason, we suggest a context-dependent quality assessment considering dimensions such as accuracy, completeness, consistency, and timeliness, and we argue that different applications have different quality requirements to consider. The management of data in Fog computing also requires particular attention to quality of service requirements. For this reason, we include QoS aspects in the data utility model, such as availability, response time, and latency. Based on the proposed data utility model, we present an approach based on a goal model capable of identifying when one or more dimensions of quality of service or data quality are violated and of suggesting which is the best action to be taken to address this violation. The proposed approach is evaluated with a real and appropriately anonymized dataset, obtained as part of the experimental procedure of a research project in which a device with a set of sensors (inertial, temperature, humidity, and light sensors) is used to collect motion and environmental data associated with the daily physical activities of healthy young volunteers.
Big data changed the way in which we collect and analyze data. In particular, the amount of available information is constantly growing and organizations rely more and more on data analysis in order ...to achieve their competitive advantage. However, such amount of data can create a real value only if combined with quality: good decisions and actions are the results of correct, reliable and complete data. In such a scenario, methods and techniques for the Data Quality assessment can support the identification of suitable data to process. If for traditional database numerous assessment methods are proposed, in the Big Data scenario new algorithms have to be designed in order to deal with novel requirements related to variety, volume and velocity issues. In particular, in this paper we highlight that dealing with heterogeneous sources requires an adaptive approach able to trigger the suitable quality assessment methods on the basis of the data type and context in which data have to be used. Furthermore, we show that in some situations it is not possible to evaluate the quality of the entire dataset due to performance and time constraints. For this reason, we suggest to focus the Data Quality assessment only on a portion of the dataset and to take into account the consequent loss of accuracy by introducing a confidence factor as a measure of the reliability of the quality assessment procedure. We propose a methodology to build a Data Quality adapter module, which selects the best configuration for the Data Quality assessment based on the user main requirements: time minimization, confidence maximization, and budget minimization. Experiments are performed by considering real data gathered from a smart city case study.
•Data Quality assessment is a key success point for applications using big data.•Data quality assessment in big data requires approximation.•Confidence enables to give hints on data quality without a complete analysis.•Confidence is sensitive to the data source and the DQ metrics considered.•Optimization is used to select the best configuration for assessing DQ.
Bots are algorithmically driven entities that act like humans in conversations via Twitter, on Facebook, in chats or Q&A sites. This paper studies how they may affect online conversations, provides a ...taxonomy of harms that may be caused, and discusses how to prevent harm by studying when abuses occur.
Abstract Data play a key role in AI systems that support decision-making processes. Data-centric AI highlights the importance of having high-quality input data to obtain reliable results. However, ...well-preparing data for machine learning is becoming difficult due to the variety of data quality issues and available data preparation tasks. For this reason, approaches that help users in performing this demanding phase are needed. This work proposes DIANA, a framework for data-centric AI to support data exploration and preparation, suggesting suitable cleaning tasks to obtain valuable analysis results. We design an adaptive self-service environment that can handle the analysis and preparation of different types of sources, i.e., tabular, and streaming data. The central component of our framework is a knowledge base that collects evidence related to the effectiveness of the data preparation actions along with the type of input data and the considered machine learning model. In this paper, we first describe the framework, the knowledge base model, and its enrichment process. Then, we show the experiments conducted to enrich the knowledge base in a particular case study: time series data streams.
Current smart contract-enabled blockchain technology exhibits limited support for data quality assessment of transaction payloads. This is critical because blockchain aims at removing intermediaries, ...which often play an important role in guaranteeing a certain level of the quality of data used by a system. Moreover, owing to the immutability typical of blockchain, poor quality data are bound to remain stored in a blockchain, possibly forever. This article contextualizes the issue of data quality in blockchains, discussing how to extend or adapt blockchain technology to support data quality assessment and identifying a set of challenges for future research.
With blockchain-oriented software engineering, the quality of input data is often overlooked. Controlling the transaction payloads in the transaction validation phase may prevent the ...poor-data-quality issues that can affect the output of even correct smart contracts.