Good data quality is crucial for any data-driven system’s effective and safe operation. For critical safety systems, the significance of data quality is even higher since incorrect or low-quality ...data may cause fatal faults. However, there are challenges in identifying and managing data quality. In particular, there is no accepted process to define and continuously test data quality concerning what is necessary for operating the system. This lack is problematic because even safety-critical systems become increasingly dependent on data. Here, we propose a Candidate Framework for Data Quality Assessment and Maintenance (CaFDaQAM) to systematically manage data quality and related requirements based on design science research. The framework is constructed based on an advanced driver assistance system (ADAS) case study. The study is based on empirical data from a literature review, focus groups, and design workshops. The proposed framework consists of four components: a
Data Quality Workflow
, a
List of Data Quality Challenges
, a
List of Data Quality Attributes
, and
Solution Candidates
. Together, the components act as tools for data quality assessment and maintenance. The candidate framework and its components were validated in a focus group.
Data quality evaluation is built upon data quality measurement results. “Data quality evaluation” uses the “data quality rules” representing the risk appetite of the organization to decide on the ...usability of the data; “data quality measurement” uses the business rules describing the “data requirements” or “data specifications” to determine the validity of the data. Consequently, to conduct meaningful and useful data quality evaluations, business rules must be first completely identified and captured at the beginning of the evaluation to perform sound measurements. We propose that the evaluation leads to better and more interpretable and useful results when the potential contribution of these business rules to the measurement of the data quality characteristics is first evaluated, avoiding the inclusion in the evaluation of those not having potential contribution and the resulting waste of resources. Considering this, we feel that for a better management of business rules for data quality evaluation, it makes sense to group all business rules having an important contribution to the evaluation of data quality characteristics, something that other business rules management methodologies have not covered yet. Through our experiences in conducting industrial projects of data quality evaluations we identified six problems when collecting and grouping the business rules. These problems make data quality evaluation processes less efficient and more costly. The main contribution of this paper is a methodology to systematically collect, group and validate the business rules to avoid or to alleviate these problems. For the sake of generalization, comparability, and reusability, we propose to do the grouping for data quality characteristics and properties defined in ISO/IEC 25012 and ISO/IEC 25024, respectively. Lastly, we validate the methodology in three case studies of real projects. From this validation, it is possible to raise the conclusion that the methodology is useful, applicable in the real world, and valid to capture and group the business rules used as a basis for data quality evaluation.
•Data quality measurement requires business rules describing the validity of data.•Data quality evaluation is performed upon data quality measurement results.•Grouping business rules can optimize the process of data quality measurement.•Grouping business rules is done according to selected data quality characteristics.•Grouping business rules helps to drive and optimize data quality improvement.
The literature provides a wide range of techniques to assess and improve the quality of data. Due to the diversity and complexity of these techniques, research has recently focused on defining ...methodologies that help the selection, customization, and application of data quality assessment and improvement techniques. The goal of this article is to provide a systematic and comparative description of such methodologies. Methodologies are compared along several dimensions, including the methodological phases and steps, the strategies and techniques, the data quality dimensions, the types of data, and, finally, the types of information systems addressed by each methodology. The article concludes with a summary description of each methodology.
DMN4DQ: When data quality meets DMN Valencia-Parra, Álvaro; Parody, Luisa; Varela-Vaca, Ángel Jesús ...
Decision Support Systems,
February 2021, 2021-02-00, 20210201, Volume:
141
Journal Article
Peer reviewed
Open access
To succeed in their business processes, organizations need data that not only attains suitable levels of quality for the task at hand, but that can also be considered as usable for the business. ...However, many researchers ground the potential usability of the data on its quality. Organizations would benefit from receiving recommendations on the usability of the data before its use. We propose that the recommendation on the usability of the data be supported by a decision process, which includes a context-dependent data-quality assessment based on business rules. Ideally, this recommendation would be generated automatically. Decision Model and Notation (DMN) enables the assessment of data quality based on the evaluation of business rules, and also, provides stakeholders (e.g., data stewards) with sound support for the automation of the whole process of generation of a recommendation regarding usability based on data quality.
The main contribution of the proposal involves designing and enabling both DMN-driven mechanisms and a guiding methodology (DMN4DQ) to support the automatic generation of a decision-based recommendation on the potential usability of a data record in terms of its level of data quality. Furthermore, the validation of the proposal is performed through the application of a real dataset.
Graphical abstract Display omitted
•Companies need high level of quality data for many contexts of use.•DMN4DQ is a context-aware systematic method for assessing data quality.•DMN4DQ relies on DMN to automatically decide about the usability of data.•Organizations formalise data quality rules according to their risk appetite.•DMN facilitates the declarative formalisation of data quality business rules.
Nowadays, IoT is being used in more and more application areas and the importance of IoT data quality is widely recognized by practitioners and researchers. The requirements for data and its quality ...vary from application to application or organization in different contexts. Many methodologies and frameworks include techniques for defining, assessing, and improving data quality. However, due to the diversity of requirements, it can be a challenge to choose the appropriate technique for the IoT system. This paper surveys data quality frameworks and methodologies for IoT data, and related international standards, comparing them in terms of data types, data quality definitions, dimensions and metrics, and the choice of assessment dimensions. The survey is intended to help narrow down the possible choices of IoT data quality management technique.
High-quality data is key to interpretable and trustworthy data analytics and the basis for meaningful data-driven decisions. In practical scenarios, data quality is typically associated with data ...preprocessing, profiling, and cleansing for subsequent tasks like data integration or data analytics. However, from a scientific perspective, a lot of research has been published about the measurement (i.e., the detection) of data quality issues and different generally applicable data quality dimensions and metrics have been discussed. In this work, we close the gap between data quality research and practical implementations with a detailed investigation on
. For the first time and in contrast to all existing data quality tool surveys, we conducted a systematic search, in which we identified 667 software tools dedicated to "data quality." To evaluate the tools, we compiled a requirements catalog with three functionality areas: (1) data profiling, (2) data quality measurement in terms of metrics, and (3) automated data quality monitoring. Using a set of predefined exclusion criteria, we selected 13 tools (8 commercial and 5 open-source tools) that provide the investigated features and are not limited to a specific domain for detailed investigation. On the one hand, this survey allows a critical discussion of concepts that are widely accepted in research, but hardly implemented in any tool observed, for example, generally applicable data quality metrics. On the other hand, it reveals potential for functional enhancement of data quality tools and supports practitioners in the selection of appropriate tools for a given use case.
Purpose
Despite growing access to data, questions of “best fit” data and the appropriate use of results in supporting decision making still plague the life cycle assessment (LCA) community. This ...discussion paper addresses revisions to assessing data quality captured in a new US Environmental Protection Agency guidance document as well as additional recommendations on data quality creation, management, and use in LCA databases and studies.
Approach
Existing data quality systems and approaches in LCA were reviewed and tested. The evaluations resulted in a revision to a commonly used pedigree matrix, for which flow and process level data quality indicators are described, more clarity for scoring criteria, and further guidance on interpretation are given.
Discussion
Increased training for practitioners on data quality application and its limits are recommended. A multi-faceted approach to data quality assessment utilizing the pedigree method alongside uncertainty analysis in result interpretation is recommended. A method of data quality score aggregation is proposed and recommendations for usage of data quality scores in existing data are made to enable improved use of data quality scores in LCA results interpretation. Roles for data generators, data repositories, and data users are described in LCA data quality management. Guidance is provided on using data with data quality scores from other systems alongside data with scores from the new system. The new pedigree matrix and recommended data quality aggregation procedure can now be implemented in openLCA software.
Future work
Additional ways in which data quality assessment might be improved and expanded are described. Interoperability efforts in LCA data should focus on descriptors to enable user scoring of data quality rather than translation of existing scores. Developing and using data quality indicators for additional dimensions of LCA data, and automation of data quality scoring through metadata extraction and comparison to goal and scope are needed.
The diffusion of Open Government Data (OGD) in recent years kept a very fast pace. However, evidence from practitioners shows that disclosing data without proper quality control may jeopardize ...dataset reuse and negatively affect civic participation. Current approaches to the problem in literature lack a comprehensive theoretical framework. Moreover, most of the evaluations concentrate on open data platforms, rather than on datasets.
In this work, we address these two limitations and set up a framework of indicators to measure the quality of Open Government Data on a series of data quality dimensions at most granular level of measurement. We validated the evaluation framework by applying it to compare two cases of Italian OGD datasets: an internationally recognized good example of OGD, with centralized disclosure and extensive data quality controls, and samples of OGD from decentralized data disclosure (municipality level), with no possibility of extensive quality controls as in the former case, hence with supposed lower quality.
Starting from measurements based on the quality framework, we were able to verify the difference in quality: the measures showed a few common acquired good practices and weaknesses, and a set of discriminating factors that pertain to the type of datasets and the overall approach. On the basis of this evaluation, we also provided technical and policy guidelines to overcome the weaknesses observed in the decentralized release policy, addressing specific quality aspects.
•We provide a metric-based evaluation framework to assess the quality of Open Government Data (OGD).•We verify the suitability of the framework by applying it to OGD datasets.•We observed that OGD from centralized data release resulted in better quality than the decentralized one.•Most common problems identified by the metrics were lack of metadata, incomplete data, lack of information on data updates.•We provide references to tools and processes, to improve weaknesses in specific quality characteristics.
•Good data quality is essential for registry-based pharmacovigilance studies.•Incorrect application of variable definitions might hamper data quality.•Written informed consent should be provided but ...is often neglected in registries.•Interventions include SOPs, training of data managers, centralized data collection.•Regular audits followed by feedback to data managers can improve data quality.
Good data quality is essential when rare disease registries are used as a data source for pharmacovigilance studies. This study investigated data quality of the Swiss cystic fibrosis (CF) registry in the frame of a European Cystic Fibrosis Society Patient Registry (ECFSPR) project aiming to implement measures to increase data reliability for registry-based research.
All 20 pediatric and adult Swiss CF centers participated in a data quality audit between 2018 and 2020, and in a re-audit in 2022. Accuracy, consistency and completeness of variables and definitions were evaluated, and missing source data and informed consents (ICs) were assessed.
The first audit included 601 out of 997 Swiss people with CF (60.3 %). Data quality, as defined by data correctness ≥95 %, was high for most of the variables. Inconsistencies of specific variables were observed because of an incorrect application of the variable definition. The proportion of missing data was low with <5 % for almost all variables. A considerable number of missing source data occurred for CFTR variants. Availability of ICs varied largely between centers (10 centers had >5 % of missing documents). After providing feedback to the centers, availability of genetic source data and ICs improved.
Data audits demonstrated an overall good data quality in the Swiss CF registry. Specific measures such as support of the participating sites, training of data managers and centralized data collection should be implemented in rare disease registries to optimize data quality and provide robust data for registry-based scientific research.
No standards exist for the handling and reporting of data quality in health research. This work introduces a data quality framework for observational health research data collections with supporting ...software implementations to facilitate harmonized data quality assessments.
Developments were guided by the evaluation of an existing data quality framework and literature reviews. Functions for the computation of data quality indicators were written in R. The concept and implementations are illustrated based on data from the population-based Study of Health in Pomerania (SHIP).
The data quality framework comprises 34 data quality indicators. These target four aspects of data quality: compliance with pre-specified structural and technical requirements (integrity); presence of data values (completeness); inadmissible or uncertain data values and contradictions (consistency); unexpected distributions and associations (accuracy). R functions calculate data quality metrics based on the provided study data and metadata and R Markdown reports are generated. Guidance on the concept and tools is available through a dedicated website.
The presented data quality framework is the first of its kind for observational health research data collections that links a formal concept to implementations in R. The framework and tools facilitate harmonized data quality assessments in pursue of transparent and reproducible research. Application scenarios comprise data quality monitoring while a study is carried out as well as performing an initial data analysis before starting substantive scientific analyses but the developments are also of relevance beyond research.