Akademska digitalna zbirka SLovenije - logo
E-viri
Celotno besedilo
Recenzirano
  • Leveraging Currency for Rep...
    Ding, Xiaoou; Wang, Hongzhi; Su, Jiaxuan; Wang, Muxian; Li, Jianzhong; Gao, Hong

    IEEE transactions on knowledge and data engineering, 2022-March-1, 2022-3-1, Letnik: 34, Številka: 3
    Journal Article

    Data quality plays a key role in big data management today. With the explosive growth of data from a variety of sources, the quality of data is faced with multiple problems. Motivated by this, we study the multiple data cleaning on incompleteness and inconsistency with currency reasoning and determination in this paper. We introduce a 4-step framework, named <inline-formula><tex-math notation="LaTeX">{\sf Imp3C}</tex-math> <mml:math><mml:mrow><mml:mi mathvariant="sans-serif">Imp</mml:mi><mml:mn mathvariant="sans-serif">3</mml:mn><mml:mi mathvariant="sans-serif">C</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="ding-ieq1-2992456.gif"/> </inline-formula>, for errors detection and quality improvement in incomplete and inconsistent data without timestamps. We achieve an integrated currency determining method to compute the currency orders among tuples, according to currency constraints. Thus, the inconsistent data and missing values are repaired effectively considering the temporal impact. For both effectiveness and efficiency consideration, we carry out inconsistency repair ahead of incompleteness repair. A currency-related consistency distance metric is defined to measure the similarity between dirty tuples and clean ones more accurately. In addition, currency orders are treated as an important feature in the missing imputation training process. The solution algorithms are introduced in detail with case studies. A thorough experiment on three real-life datasets verifies our method <inline-formula><tex-math notation="LaTeX">{\sf Imp3C}</tex-math> <mml:math><mml:mrow><mml:mi mathvariant="sans-serif">Imp</mml:mi><mml:mn mathvariant="sans-serif">3</mml:mn><mml:mi mathvariant="sans-serif">C</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="ding-ieq2-2992456.gif"/> </inline-formula> improves the performance of data repairing with multiple quality problems. <inline-formula><tex-math notation="LaTeX">{\sf Imp3C}</tex-math> <mml:math><mml:mrow><mml:mi mathvariant="sans-serif">Imp</mml:mi><mml:mn mathvariant="sans-serif">3</mml:mn><mml:mi mathvariant="sans-serif">C</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="ding-ieq3-2992456.gif"/> </inline-formula> outperforms the existing advanced methods, especially in the datasets with complex currency orders.