Poor-quality data: a major obstacle in Machine Learning
As the commercial applications of artificial intelligence and machine learning multiply and rapidly transform various sectors, one truth remains: data quality is the pillar on which these technological advances rest.
β
Machine Learning (ML) has become a must in many industries, and has been used to build a variety of AI products for some years now. The dominant approach is data-centric, and for ML models to really deliver value to a business, the quality of the data used is of fundamental importance. In this article, we explore why data quality is essential, and why painstaking, painstaking data preparation is the bedrock of the vast majority of AI products.
β
Why is data quality the cornerstone of your AI projects?
β
ML algorithms use data to learn and make predictions. However, not all data is equally valuable. Data quality is a major determinant of the accuracy and reliability of ML models.
β
Professionals working on ML projects (Data Scientists, Developers, Data Labelers, etc.) are well aware of the challenges. Many projects seem to stagnate during the test phases, before deployment, mainly due to the lack of quality in data annotation at scale. Human error, unclear assumptions, the subjective and ambiguous nature of the annotation task and, above all, a lack of supervision and consideration for the work carried out by Data Labelers often contribute to these problems.
β
Data annotated en masse but in an approximate way... a disaster!
β
Data inaccuracy can be the result of human error, faulty data collection techniques or problems with the data source. When an ML model is trained on incorrect data, it can make poor decisions.
β
Some examples to illustrate the impact of models trained with imperfect data on products and use cases:
β
1. Wrong medical diagnosis
Imagine an AI system designed to help doctors diagnose diseases. If this system is trained on incorrect or incomplete medical data, it could lead to incorrect diagnoses, putting patients' lives at risk. Such a situation underlines the imperative of accurate and complete medical data to guarantee the reliability of AI systems in medicine. To avoid this, and enable the development of high-performance medical AI products and the training of surgeons worldwide, π the SDSC collective is working on an annotated medical database for AI.
β
2. Machine translation errors
Machine translation systems use machine learning models to translate texts. If the training data contains errors or incorrect translations, the machine translation results may be inaccurate, leading to misunderstandings and communication problems.
β
3. False positives in IT security
In IT security, intrusion and malicious activity detection systems are based on ML models. If the data used to train these models contains incorrect or mislabeled examples, this can lead to false positives, meaning that legitimate actions are wrongly reported as threats, resulting in an unnecessary reaction and wasting the time of threat monitoring activities (SOC), which are polluted by false alerts.
β
4. Imperfect film recommendation systems
Imagine a movie recommendation system. Imagine that this system, based on machine learning, recommends movies to users based on their past preferences. However, an insidious bias creeps into the model, causing users to be recommended mainly films of a specific genre, such as action, to the detriment of other genres such as comedy or drama.
β
The dataset used to train the model was unbalanced, with a massive over-representation of action films, while other genres were under-represented. The model thus learned to favor action films, neglecting the varied preferences of users. This example highlights the importance of balanced and representative training data to ensure accurate and relevant recommendations.
β
5. Failure of a vehicle's emergency braking system
Imagine a situation where a car manufacturer implements an automated emergency braking system, designed to detect obstacles and stop the car in the event of imminent danger. This system relies on sensors, cameras and mapping data to function properly.
β
In initial road tests, the emergency braking system failed to react appropriately to pedestrians and obstacles. In some cases, it brakes abruptly for no reason, while in others it fails to react at all to moving objects. These malfunctions are due to erroneous sensor data and inconsistencies in the mapping data used to form the system model.
β
It turns out that the data collected for training the emergency braking model was incomplete and inaccurate. The test scenarios had not covered enough real-world situations, resulting in a system that was ill-prepared to react correctly in emergency situations.
β
This example underlines that, even in a sector like the automotive industry, where safety is paramount, the quality of the data used to train autonomous systems is crucial. Incorrect or incomplete data can endanger the lives of drivers, passengers and pedestrians, highlighting the importance of rigorous data collection and validation to ensure the reliability of autonomous driving systems.
β
To mitigate the impact of inaccurate data, it's essential to carefully validate data before using it. Annotators need to be trained in the task, in annotation software (π LabelBox, π Encord, π V7 Labs, π Label Studio, π CVATetc.) and the required accuracy. Clear guidelines and examples of annotated data can ensure data consistency and accuracy.
β
The trap of unrepresentative data
β
Non-representative data can distort ML models. Numerous examples in the field of easy recognition have hit the headlines. One example is data quality bias in facial recognition systems, which are increasingly used for authentication, security and other applications. However, several facial recognition systems have shown patterns of racial and ethnic bias due to unbalanced training data.
β
Take the case of a facial recognition system used by law enforcement agencies to identify suspects. If the training data consists mainly of faces from a single ethnic group, the system may have difficulty correctly identifying faces from other ethnic groups. This can lead to misidentification, unfair arrests and the perpetuation of discriminatory stereotypes.
β
This example highlights the need for diverse and representative training data to ensure that facial recognition systems do not favor one ethnic group over another, and to avoid the damaging consequences associated with discrimination and biased justice. What's more, depending on the use case, these data will benefit from being prepared by groups of annotators with different profiles.
β
In conclusion...
β
Data quality is an essential pillar in the success of your AI projects. Errors in annotation, biased data and missing information can jeopardize the reliability of ML models. By following best practices such as training image annotators, π videos and text annotators, data validation and continuous monitoring, Data Scientists and other AI developers can maximize the value of their ML initiatives and avoid many of the pitfalls associated with data preparation.