大数据清洗

发布者:刘显敏发布时间:2021-08-23浏览次数:10


项目概要

数据质量问题会对大数据的应用产生负面影响,因此需要对大数据中存在的数据质量问题进行清洗修复。针对多模态数据中多种大数据质量问题,项目组提出了一系列数据清洗算法,并开发了基于MapReduceHyracks等并行计算平台大数据清洗系统CleanCloud和面向工业时序数据的清洗系统Cleanits



并行数据清洗系统CleanCloud

主要功能:

1.实体识别:对不同数据来源中同一对象实体进行识别。

2.真值发现:在混有数据质量错误的冲突数据中找到属性的真实值。

3.不一致检测与修复:实际数据集合中通常包含了某些违反最初定义的完整性约束的数据,造成集合内或者不同数据集合间的不一致情况,本系统可利用完整性约束对数据中不一致情况进行检测和修复清洗。

4.缺失值填充:对数据集合中存在的数据缺失问题进行有效的填充修复。

5.数据清洗结果可视化:数据质量检测和清洗的结果以图、表形式展示给用户,让用户对数据集合的质量评估情况有直观的认识。


系统特点:

可扩展性:构建FLI三层体系实现了在MapReduce上部署的多个子任务合并优化策略,实现并行数据清洗过程。

有效性:对于多种数据质量问题:数据不一致、数据缺失、数据错误等能够进行有效的检测和修复。

高效性:通过对业务流程的优化、减少MapReduce的轮数从而减少系统运行时间,从理论和实验上均验证了系统进行数据清洗的高效率。


技术特点

在实体识别上,读取预处理的数据后,对同一属性索引表中的实体按计算相似度与阈值进行比较,大于阈值的相似对输出成相似对集合文件,之后采用实体划分模型,依据相似对集合文件生成图,通过对图的划分获得实体划分结果。

在不一致数据的检测和修复上,首先让用户输入给定的CFD规则,根据这些CFD规则,判断数据集合中的常量违反与变量违反,本系统优化了不一致数据的检测,将常量违反检测与修复的功能直接迁移到变量违反检测与修复的第一轮MapReduce中,使MapReduce轮数和IO次数均有减少。

在缺失值填充上,本系统利用朴素贝叶斯分类的缺失值填充机制,识别出带有缺失值的原始数据后,根据公式计算含有缺失值的元组在其依赖属性取值范围内确定待填充值的概率,选择条件概率最大的进行填充。本系统优化缺失填充过程,通过更改参数估计模块输出数据所包含的信息和格式,使其数据结果直接应用于填充模块,减少中间计算量。



页面展示

数据源连接页面


数据处理进度页面


数据清洗后数据集展示页面



Cleanits:面向工业时间序列数据的清洗系统

概述

工业大数据分析的合理性和可靠性对数据质量提出较高要求,然而,机器采集的原始数据中往往混有许多错误,物联网数据中存在无效工况、时标错误、时标不齐、异常工况等问题。这些低质量数据限制了对工业数据的深入分析。

工业数据质量问题日益凸显,但面向工业时序数据的清洗与修复研究却刚刚起步,在这种情况下,我们设计开发了工业时间序列数据清洗系统Cleanits,用于对多维工业时间序列数据中的质量问题进行检测和修复。本系统对于三类严重的工业数据质量问题:序列片段缺失、序列区间错位、异常序列区间,实现了有效的检测和修复。本系统设计相应的模块,支持对领域专家知识、有标签的样本数据等进行智能化建模分析,提高数据清洗算法的精度。


主要功能:

  1. 对工业上传感器采集的时间序列数据进行不完整数据填充与修复;

  1. 实现了基于自回归的修复方法;

  2. 实现了基于滑动平均的修复方法;

  1. 对数据进行异常值的检测和修复

1)实现了基于顺序约束的检测与修复功能;

2)实现了基于方差约束的检测与修复功能;

  1. 对多维时间序列的错列问题进行检测和修复

4.参数设置与录入功能

5.  数据清洗结果统计与展示功能:数据清洗结果以图、表形式展示给用户。


系统特点:

  1. 实用性强:与工业场景紧密结合,满足制造业大数据中时间序列数据质量问题检测和清洗的实际需求。

  2. 清洗效果优:本系统能够对时间序列中的不精确数据点实现有效的检测,误检、漏检率低,检测的准确率高,检测方法的性能满足实际需求。

  3. 清洗效率高:通过对算法的优化,减少识别算法中的判定次数,从而减少了算法的运行时间。



技术特点

系统实现了对传感器采集的时间序列数据中三类主要的数据质量问题进行检测和清洗修复。在不完整数据清洗模块,实现了两种基于统计模型的修复算法:自回归模型和滑动平均模型。在不精确数据清洗模块,实现了两种基于约束的检测方法:顺序约束和方差约束,在清洗方法中,实现了两类基于统计模型的修复方法,和基于线性约束的修复方法。在不一致数据清洗模块实现了子序列错位存放的识别与修复。可以满足对具有不同特点的序列进行有效的清洗操作。此外,多种类型的清洗方法可供用户选择,以满足用户在在清洗效率、交互频率、清洗细度等方面的个性化需求。




页面展示

不精确数据修复后结果展示页面

数据清洗结果统计页面






研究成果


论文

  1. 李亚坤,王宏志,高宏,李建中.基于实体描述属性技术的XML重复对象检测方法.计算机学报, 2011, 34(11), 2131—2141.

  2. 王宏志,樊文飞.复杂数据上的实体识别技术研究.计算机学报,2011, 34(10), 1843—1852.  

  3. Lingli Li, Jianzhong Li, Hongzhi Wang, Hong Gao: Context-based entity description rule for entity resolution. CIKM 2011: 1725-1730

  4. 刘永楠,王宏志,高宏.MapReduce框架下基于字符串波形的实体识别方法.计算机科学与探索,2011,5(8): 730-739  

  5. Liu Yongnan,Wang Hongzhi,Gao Hong. A Fast Entity Resolution Method based on Wave of Records. CECNet2011.

  6. Yakun Li, Hongzhi Wang and Hong Gao. Efficient Entity Resolution based on Sequence Rules. 2011 international conference on Computer Science and Information Engineering (CSIE2011).

  7. Hongzhi Wang, Jianzhong Li, Ran Huo, Li Jia, Lian Jin, Xueying Men, Hui Xie. HITCleaner: A Light-weight Online Data Cleaning System. Proceedings of DASFAA 2013, 481-484. Demo.

  8. Fangda Wang, Hongzhi Wang, Jianzhong Li, Hong Gao. Graph-based Reference Table Construction to Facilitate Entity Matching. Journal of Systems and SoftwareVolume 86, Issue 6, 2013, 1679–1688.  

  9. Yakun LiHongzhi WangHong GaoJianzhong Li. An Efficient Entity Resolution Method for Large Relations. International Journal of Cooperative Information Systems, Vol. 22, No. 1 (2013), 1–17.  

  10. Huabin Feng, Hongzhi Wang, Jianzhong Li, Hong Gao. Entity Resolution on Uncertain Relations. WAIM 2013.  

  11. Lian Jin, Hongzhi Wang, Hong Gao. Imputation for Categorical Attributes with probabilistic Reasoning. WAIM 2013.  

  12. Rui Guo, Hongzhi Wang, Jianzhong Li, Hong Gao. CUVIM: Extracting Fresh Information from Social Network. WAIM 2013.  

  13. Xie Hui, Hongzhi Wang, Jianzhong Li, Hong Gao. A data cleaning framework based on user feedback. WAIM 2013.  

  14. Li Jia, Hongzhi Wang, Jianzhong Li, Hong Gao. Incremental Truth Discovery for Information from Multiple Data Sources. MDSP 2013.  

  15. 李明达,王宏志,张佳程,李建中,高宏. PEIF: 基于并行机群的大数据实体识别算法.计算机研究与发展,50(Suppl.): 211-220,2013.

  16. 金连,王宏志,黄沈斌,高宏.基于Map-Reduce的大数据缺失值填充算法.计算机研究与发展,50(Suppl.): 312-321, 2013.

  17. 霍然,王宏志,朱鎔,李建中,高宏.基于Map-Reduce的大数据实体识别算法.第一届全国大数据会议,计算机研究与发展(增刊).2013

  18. Hongzhi Wang, Mingda Li, Yingyi Bu, Jianzhong Li, Hong Gao, Jiacheng Zhang Cleanix: A Big Data Cleaning Parfait. CIKM 2014

  19. Yan Zhang, Hongzhi Wang, Zhongsheng Yang, Jianzhong Li. Relative Accuracy Evaluation. PLoS ONE 9(8): e103853. doi:10.1371/journal.pone.0103853. 2014.

  20. Ye Chen, Hongzhi Wang. Capture Missing Values based on Crowdsourcing. WASA 2014 Workshop.  

  21. Yan Zhang, Hongzhi Wang. Accuracy Evaluation for Sensed Data. WASA 2014  

  22. Mingda Li, Hongzhi Wang and Ye Li. Sectional and Conditional Functional Dependencies. WASA 2014 Workshop..

  23. Chen Ye, Hongzhi Wang. Truth discovery based on Crowdsourcing. WAIM 2014..

  24. Chen Ye, Hongzhi Wang, Keli Li, Qian Chen, Jianhua Chen, Jiangduo Song, Weidong Yuan: CrowdCleaner: A Data Cleaning System Based on Crowdsourcing. APWeb 2014: 657-661

  25. Guangze Liu, Hongzhi Wang, ChengHui Chen, Hong Gao: TruthOrRumor: Truth Judgment from Web. APWeb 2014: 674-678

  26. 王宏志.大数据质量管理:问题与研究进展.科技导报201432(34): 78-84.

  27. Hongzhi Wang, Mingda Li, Yingyi Bu, Jianzhong Li, Hong Gao, Jiacheng Zhang. Cleanix: a Parallel Big Data Cleaning System. Sigmod Record 44(4), 2015.

  28. 杨东华,李宁宁,王宏志, 李建中,高宏.基于任务合并的并行大数据清洗过程优化.计算机学报,2015Vol.38

  29. Ming Yan, Yan Zhang, Hongzhi Wang: Tree-Based Metric Learning for Distance Computation in Data Mining. APWeb 2015: 377-388

  30. Chang Lu, Hongzhi Wang, Yan Zhang, Hong Gao. Euclidean-based Entity Resolution for Evolving Data. 2015 Fifth International Conference on Instrumentation and Measurement, Computer, Communication and Control.

  31.  Zeyu Li, Hongzhi Wang, Wei Shao, Jianzhong Li, Hong Gao: Repairing Data through Regular Expressions. PVLDB 9(5): 432-443 (2016)

  32. Hongzhi Wang, Jianzhong Li, Hong Gao: Efficient entity resolution based on subgraph cohesion. Knowl. Inf. Syst. 46(2): 285-314 (2016)

  33. Yue Wang, Hongzhi Wang, Liyan Zhang, Yang Wang, Jianzhong Li, Hong Gao: Extend tree edit distance for effective object identification. Knowl. Inf. Syst. 46(3): 629-656 (2016)

  34. Yan Zhang, Hongzhi Wang, Hong Gao, Jianzhong Li: Efficient accuracy evaluation for multi-modal sensed data. J. Comb. Optim. 32(4): 1068-1088 (2016)  

  35. Yue Wang, Hongzhi Wang, Jianzhong Li, Hong Gao: Efficient graph similarity join for information integration on graphs. Frontiers Comput. Sci. 10(2): 317-329 (2016)  

  36. Rui Guo, Hongzhi Wang, Mengwen Chen, Jianzhong Li, Hong Gao:Parallelizing the extraction of fresh information from online social networks. Future Generation Comp. Syst. 59: 33-46 (2016)

  37. Meifan Zhang, Hongzhi Wang, Jianzhong Li, Hong Gao. One-pass Inconsistency Detection Algorithms for Big Data. DASFAA 2016.

  38. Chen Ye, Hongzhi Wang, Jianzhong Li, Hong Gao, Siyao Cheng. Crowdsourcing-enhanced Missing Values Imputation based on Bayesian Network. DASFAA 2016.

  39. 丁小欧,王宏志,张笑影,李建中,高宏.数据质量多种性质的关联关系研究.软件学报,2016,27(7).

  40. 李建中,王宏志.大数据可用性的研究进展.软件学报,2016,27(7).

  41. Yitong Gao, Yan Zhang, Hongzhi Wang, Jianzhong Li, Hong Gao: A Distributed Load Balance Algorithm of MapReduce for Data Quality Detection. DASFAA Workshops 2016: 294-306

  42. Xiaoou Ding, Hongzhi Wang, Yitong Gao, Jianzhong Li, Hong Gao. Efficient Currency Determination Algorithms for Dynamic Data. TSINGHUA SCIENCE AND TECHNOLOGY, 22(3), 227-242, 2017.

  43. Hongzhi Wang, Zhixin Qi, Ruoxi Shi, Jian-Zhong Li, Hong Gao: COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base. J. Comput. Sci. Technol. 32(5): 845-857 (2017)

  44. Hongzhi Wang, Xiaoou Ding, Xiangying Chen, Jianzhong Li, Hong Gao: CleanCloud: Cleaning Big Data on Cloud. CIKM 2017: 2543-2546

  45. 叶晨,王宏志.基于众包的数据清洗模型研究.人工智能学会通讯,31-382017年第3.

  46. Zhixin Qi, Hongzhi Wang, Fanshan Meng, Jianzhong Li, Hong Gao: Capture Missing Values with Inference on Knowledge Base. DASFAA Workshops 2017: 185-194

  47. Yiwen Tang, Hongzhi Wang, Shiwei Zhang, Huijun Zhang, Ruoxi Shi: Efficient Web-Based Data Imputation with Graph Model. DASFAA Workshops 2017: 213-226

  48. Xiaoou Ding, Hongzhi Wang, Yitong Gao, Jianzhong Li, Hong Gao: Determining the currency of dynamic data. ACM TUR-C 2017: 17:1-17:6

  49. Yanjie Wei, Hongzhi Wang, Shengfei Shi, Hong Gao, Jianzhong Li: Any-Time Methods for Time-Series Prediction with Missing Observations. BigData Congress 2017: 427-430

  50. Hongzhi Wang, Xiaoou Ding, Jianzhong Li, Hong Gao: Rule-Based Entity Resolution on Database with Hidden Temporal Information. IEEE Trans. Knowl. Data Eng. 30(11): 2199-2212 (2018)

  51. Zhixin Qi, Hongzhi Wang, Jianzhong Li, Hong Gao: FROG: Inference from knowledge base for missing value imputation. Knowl.-Based Syst. 145: 77-90 (2018)

  52. Jizhou Sun, Jianzhong Li, Hong Gao, Hongzhi Wang. Truth discovery on inconsistent relational data. Tsinghua Science and Technology, v 23, n 3, p 288-302, 2018

  53. Hiba Abu Ahmad, Hongzhi Wang: An effective weighted rule-based method for entity resolution. Distributed and Parallel Databases 36(3): 593-612 (2018)

  54. Wei Yin, Tianbai Yue, Hongzhi Wang, Yanhao Huang, Yaping Li: Time Series Cleaning Under Variance Constraints. DASFAA Workshops 2018: 108-113

  55. Xiaoou Ding, Hongzhi Wang, Jiaxuan Su, Zijue Li, Jianzhong Li, Hong Gao: Cleanits: A Data Cleaning System for Industrial Time Series. PVLDB 12(12): 1786-1789 (2019)

  56. Hongzhi Wang, Xiaoou Ding, Jianzhong Li, Hong Gao: Rule-Based Entity Resolution on Database with Hidden Temporal Information (Extended Abstract). ICDE 2019: 2143-2144

  57. Meifan Zhang, Hongzhi Wang, Jianzhong Li, Hong Gao: One-Pass Inconsistency Detection Algorithms for Big Data. IEEE Access 7: 22377-22394 (2019)

  58. Chen Ye, Hongzhi Wang, Tingting Ma, Jing Gao, Hengtong Zhang, Jianzhong Li: PatternFinder: Pattern discovery for truth discovery. Knowl.-Based Syst. 176: 97-109 (2019)

  59. Chen Ye, Qi Li, Hengtong Zhang, Hongzhi Wang, Jing Gao, Jianzhong Li: AutoRepair: an automatic repairing approach over multi-source data. Knowl. Inf. Syst. 61(1): 227-257 (2019)

  60. Mohamed Jaward Bah, Hongzhi Wang, Mohamed Hammad, Furkh Zeshan, Hanan Aljuaid: An Effective Minimal Probing Approach With Micro-Cluster for Distance-Based Outlier Detection in Data Streams. IEEE Access 7: 154922-154934 (2019)

  61. 丁小欧,王宏志,于晟健.工业时序大数据质量管理[J].大数据,2019,5(06):1-11.

  62. Yiming Lin, Hongzhi Wang, Jianzhong Li, Hong Gao: Efficient Entity Resolution on Heterogeneous Records. IEEE Trans. Knowl. Data Eng. 32(5): 912-926 (2020)

  63. Chen Ye, Hongzhi Wang, Kangjie Zheng, Jing Gao, Jianzhong Li: Multi-source data repairing powered by integrity constraints and source reliability. Inf. Sci. 507: 386-403 (2020)

  64. Chen Ye, Hongzhi Wang, Wenbo Lu, Jianzhong Li: Effective Bayesian-network-based missing value imputation enhanced by crowdsourcing. Knowl. Based Syst. 190: 105199 (2020)

  65. Yinan An, Sifan Liu, Hongzhi Wang: Error Detection in a Large-Scale Lexical Taxonomy. Information 11(2): 97 (2020)

  66. Mohamed Jaward Bah, Hongzhi Wang: A Parametric and Non-Parametric Approach for High-Accurate Outlier Detection. J. Inf. Sci. Eng. 36(2): 441-465 (2020)

  67. 叶晨,王宏志,高宏,李建中.面向众包数据清洗的主动学习技术.软件学报,2020,31(4):1162–1172.

  68. 丁小欧,于晟健,王沐贤,王宏志,高宏,杨东华.基于相关性分析的工业时序数据异常检测.软件学报,2020,31(3): 726−747

  69. Mingda Li, Hongzhi Wang, Jianzhong Li: Mining conditional functional dependency rules on big data. Big Data Min. Anal. 3(1): 68-84 (2020)

  70. Zijue Li, Xiaoou Ding, Hongzhi Wang: An Effective Constraint-Based Anomaly Detection Approach on Multivariate Time Series. APWeb/WAIM (2) 2020: 61-69

  71. Yiming Lin, Hongzhi Wang, Jianzhong Li, Hong Gao: Efficient Entity Resolution on Heterogeneous Records (Extended abstract). ICDE 2020: 2074-2075

  72. Hiba Abu Ahmad, Hongzhi Wang: Automatic weighted matching rectifying rule discovery for data repairing. VLDB J. 29(6): 1433-1447 (2020)

  73. Alladoumbaye Ngueilbaye, Hongzhi Wang, Mehak Khan, Daouda Ahmat Mahamat: Adoption of human metabolic processes as Data Quality Based Models. J. Supercomput. 77(2): 1779-1817 (2021)

  74. Zhixin Qi, Hongzhi Wang: Dirty-Data Impacts on Regression Models: An Experimental Evaluation. DASFAA (1) 2021: 88-95

  75. Alladoumbaye Ngueilbaye, Hongzhi Wang, Daouda Ahmat Mahamat, Sahalu B. Junaidu: Modulo 9 model-based learning for missing data imputation. Appl. Soft Comput. 103: 107167 (2021)

  76. Xiaoou Ding, Hongzhi Wang, Jiaxuan Su, Muxian Wang, Jianzhong Li, Hong Gao: Leveraging Currency for Repairing Inconsistent and Incomplete Data (Extended Abstract). ICDE 2021: 2315-2316

  77. Chen Ye, Hongzhi Wang, Kangjie Zheng, YouKang Kong, Rong Zhu, Jing Gao, Jianzhong Li: Constrained Truth Discovery (Extended Abstract). ICDE 2021: 2356-2357

  78. Alladoumbaye Ngueilbaye, Hongzhi Wang, Daouda Ahmat Mahamat, Ibrahim A. Elgendy, Sahalu B. Junaidu: Methods for detecting and correcting contextual data quality problems. Intell. Data Anal. 25(4): 763-787 (2021)


专利

  1. 面向海量异构数据的模式集成方法及装置,发明专利(已授权)201711116061.4

  2. 一种面向大数据的并行系统优化方法,发明专利(已授权),201710045825.9

  3. 使用CFDs的数据清洗方法、计算机设备和可读存储介质,发明专利(已公开),202010124832X


软著

  1. Cleancloud:并行大数据清洗系统v1.0



项目联系人:王宏志(wangzh@hit.edu.cn)