I was attending Stanford's MLSys talks a few weeks ago, and Theodoros Rekatsinas was speaking
about data quality management. The talk is fantastic for anyone interested in data cleaning and data quality.
One of the tools he released (open source!) is HoloClean
. From their website:HoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and multiple other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks.