Intrinsic Self-Supervision for Data Quality Audits

Venue: NeurIPS
Year: 2024
Authors: Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Labelling Consortium, Matthew Groh, Alexander A. Navarini, Marc Pouly
Topic: Datasets

🌟 Highlights

Introduces a dataset cleaning framework that I was able to install and use in less than 2 minutes to clean some sample datasets. I would need to test further to fully evaluate it's effectiveness for my domain.

📝 Summary

In the field of machine learning, it's known that even minor contamination of a dataset can result in significant model degradation. Validating and cleaning these larger datasets remains a challenging endeavor, often because manual verification is infeasible. The authors introduce SelfClean, a combination of context-aware self-supervised representation learning and distance-based indicators to find off-topic, near-duplicate, and label errors in datasets. The authors show that their method can identify up to 16% of issues in widely-used general vision and medical image datasets. surpassing state-of-the-art.

🧩 Key Contributions

A self-supervised encoder trained from scratch directly on a noisy dataset. Utilizing this encoder can help identify three types of dataset issues: Off-topic samples, near-duplicates, and label errors. Idenification of these issues is always distance-based on the learned representation space, forgoing the need to have clean reference data.
The system supports both human-in-the-loop and full-automatic flagging.
The system formulates dataset cleaning as a set of ranking problems to reduce the effort needed for manual inspection, or a scoring problem for fully automatic flagging.

✅ Strengths

The paper is very approachable.
The authors conveniently published it as an easy to install drop-in python package with sufficient documentation to add into any project. I was able to set it up and run it in 2 minutes. The package even contains tests and extra examples.
Method looks to be well-generalizable across domains by retraining the encoder, as long as there is decent enough quality in the learned representation.

⚠️ Weaknesses / Questions

It's unclear how well this would work with extremely large datasets. The near-duplicate detection method they state is $O(N^2)$ , and would likely need further optimization to work efficiently.
The entire algorithm is very close to data cleaning via typical clustering. The distinction between their technique and "simple clustering in disguise" is eventually established, but it did take a bit to prove this to me.
The natural contamination evaluations are way more useful and meaningful than the synthetic contamination, which the authors themselves also acknowledge. I wonder if synthetic contamination should have been moved to their textbook of an appendix.
The method may not capture subtle issues, and there is a very real possibility that there is not enough quality in the learned representation of an arbitrary dataset. The authors do acknowledge this truthfully in later sections.
The appendix is incredibly long. It's almost like I'm reading a textbook. Not necessarily a weakness, more an observation. Some of the weaknesses I initially wrote and later deleted (such as potential for flagging minority groups as outliers) were actually covered in the appendix. It seems like they really wanted to try to cover all their bases.

🔍 Related Work

Data Quality
Dataset Cleaning

📄 Attachments

PDF: 📄 View PDF
Code: 🧑‍💻 GitHub Repository
Paper Link: 🔗 External Page