Manual data annotation strategy in AI: still valid in 2024?
Data annotation: is it necessary for my AI development project and what strategy should I adopt?
β
Introduction
The quality of training data plays a key role in the development of accurate, efficient and reliable AI algorithms, underlining the importance of professional data annotation teams to the success of high-performance AI initiatives.
β
When undertaking an AI project based on unstructured data, it's important to keep in mind the importance of data annotation, as part of the AI development cycles. This article aims to serve as a comprehensive guide to help you set up your data annotation strategy for AI development. Although this step is not systematically required, it plays a decisive role in understanding and exploiting data to build high-performance products.
β
We'll say it again and again in this article: machine learning, a fundamental aspect of modern AI systems, relies heavily on data annotation. This practice enables machines to improve their results by imitating human cognitive processes without direct intervention. It is therefore important to understand this process and, above all, the issues involved.
β
β
Reminder: data annotation in a nutshell
β
Define the different types of data annotation
The termdata annotation"encompasses a variety of methods used to enrich data in formats such as image, text, audio or video. It involves enriching structured or, more frequently, unstructured data with metadata, to facilitate interpretation by artificial intelligence algorithms.
β
Below, we explore each category in more detail.
β
Image annotation
Image annotation enables artificial intelligence (AI) models to instantly and accurately distinguish various visual elements, such as eyes, nose and eyelashes, when analyzing an individual's photo. This precision is necessary for applications such as facial filters or facial recognition, which adapt to the shape of the face and the distance from the camera. Annotations can include captions or labels, helping algorithms to recognize and understand images for autonomous learning. The main types of image annotation include classificationobject object recognitionand segmentation.
β
Audio annotation
Audio annotation deals with dynamic files and must take into account various parameters such as language, speaker demographics, dialects, and emotions. Techniques such as time-stamping and audio tagging are essential, including the annotation of non-verbal features such as silences and background noise.
β
Video annotation
It may seem silly to mention it, but unlike a still image, a video consists of a series of images that simulate movement. Video annotation includes the addition of key points polygonsand frames to mark various objects across successive images. This approach enables AI models to learn the movement and behavior of objects, essential for functions such as object localization and tracking.
β
Video annotation tasks call on specific techniques such asinterpolation. Interpolation, in video annotation, is a technique used to simplify and speed up the video processing process, particularly when tracking moving objects over several images.
β
Text annotation
Textual data is omnipresent, from customer comments to mentions on social networks. Text annotation requires an understanding of context, the meaning of words, and the relationship between certain phrases.
β
Annotation tasks such assemantic annotationintention annotation and sentiment annotation enable AI models to navigate the complexity of human language, including sarcasm and humor. Other processes include named entity recognition and linking, which identifies and links textual elements to specific entities, and text categorization, which classifies text according to different topics or sentiments.
β
β
Why use data annotation tasks?
The use of data annotation tasks is an essential process that underlines the importance of accuracy and authenticity in annotated datasets for machine learning. This is an important task, not to be neglected in the preparation of datasets used to train artificial intelligence.
β
β
In this article, we explore the need for an industrial annotation phase in your artificial intelligence development cycles. We'll be looking at the strategies to be adopted (whether manual or automated annotation, or automated and enriched by manual validations).
β
β
β
Which data? Structured, semi-structured or unstructured?
β
β
Understanding the nature of data
When working on your annotation strategy for AI, the first step is to understand the nature of the data to be analyzed. This may be textual data, images in various sectors such as healthcare for annotation of medical images, Retail for product images, and industry for images of manufacturing processes, or videos for example.
β
The nature of the data (structured or unstructured) and the total volume of data are decisive factors. Is it necessary to annotate, and if so, what approach should be adopted? Manual data annotation plays a critical role in industries such as healthcare for the annotation of medical images, since it is the only way to obtain reliable, unbiased datasets for training object detection models, for example.
β
β
Is it really necessary to label data?
Data labeling, or the act of annotating and marking data to make it recognizable and intelligible to machines, encompasses processes such as cleaning, transcription, labeling itself (data labeling), and the quality assurance process.
β
This step, critical in the training process of machine learning and artificial intelligence models, enables AI models to train themselves to solve real-world challenges without human intervention.
β
It is essential to discern the differences between manual and automatic annotation in the data processing process prior to the development of an AI product.
β
β
Manual or automatic data annotation: what are the differences?
β
β
What about manual annotation?
Manual annotation involves the assignment of labels to documents or subsets of documents by human participants (annotators). annotators also known as Data Labelers). This critical task in the AI development process ensures machine recognition of data for prediction and machine learning applications.
β
Automating data annotation with LLMs: a reality?
Automatic annotation, or data annotation, involves computer programs in this task, covering a wide range of AI applications such as autonomous driving, and highlights its essential role and applications in AI technologies. Recently, many companies have been talking about the possibility of annotating data with LLMs. What's the latest?
β
In reality, data annotation tasks can be automated using a variety of methods, including techniques based on a set of rules, or supervised learning algorithms used for annotation (and therefore, whose purpose is not to be a product for the end user, but rather an AI used to prepare data for other AIs). The latter supervised learning algorithms require a prior data annotation phase, no matter what anyone says.
β
How do I choose between manual and automatic annotation?
The choice between manual and automatic annotation depends largely on the characteristics of the project. You have to keep in mind your final need: if I'm looking to build a "ground truth" dataset, I'll have to choose between manual and automatic annotation.ground truthIf I'm looking to build a "ground truth" dataset, it's unlikely that automatic annotation, which is often not very precise, will meet my needs. However, while manual annotation often offers unrivalled accuracy, it can be costly and time-consuming.
β
It is also possible to opt for a hybrid approach, combining the advantages of both methods to maximize efficiency while preserving annotation quality. We can't stress this enough: understanding the needs of your use case and the expected level of data quality are the main criteria for choosing the annotation method best suited to training your AI.
β
β
β
β
β
β
β
β
Don't be fooled by the promises of 100% automatic annotation
β
β
Promises, promises, promises
The promise of 100% automatic annotation is seductive, not least because of the speed, lower costs and ability to automate large volumes of data. However, it's important not to be fooled by the idea that automated annotation can completely replace human intervention, especially in cases where data accuracy and contextualization are essential.
β
Large language models, such as OpenAI's GPT-4offer promising capabilities for automatic annotation, processing large amounts of textual data quickly and cost-effectively. They can be used for annotation tasks in the social sciences, showing an ability to reproduce annotation tasks on data already labeled by humans, with reasonable accuracyHowever, the performance of these models can vary and is often stronger in recall than in precision, indicating a tendency to correctly identify positive cases but with higher risk of error.
β
β
Tools to optimize manual annotation processes
On the other hand annotation platforms such as CVAT offer automated annotation functionalities for computer vision tasks computer vision tasks, enabling greater scale and accuracy in specific projects. They enable annotation of bounding boxesannotation object detectionimage segmentation and more, with a degree of task-based automation to help process larger volumes of data. If this facilitates the work of annotatorsIf this makes annotators' work easier, it doesn't make their intervention any less important: if we associate these functions with automation, we're really talking about making manual tasks more efficient, not automating a workflow 100%!
β
Other platforms, such as Argillaare designed to facilitate data annotation, dataset management and model monitoring in the development of machine learning systems. This platform enables users to build and refine datasets with an intuitive interface that supports a variety of annotation types, such as text labels and image annotations. While there's no question of automation per se, platforms like Argilla pave the way for a hybrid approach to data annotation for AI...
β
β
A hybrid approach: the key to success?
Hybrid approaches, combining manual and automatic annotation, can also be implemented, improving accuracy while reducing the time and costs associated with annotating large datasets.
β
These approaches take advantage of AI to pre-annotate datathat annotators annotators can then check and adjust if necessary. A hybrid approach achieves high-quality annotations by exploiting both the efficiency of automation and the finesse of human analysis.
β
The integration of these advanced automatic and semi-automatic annotation tools is essential for Machine Learning and computer vision projects in particular, enabling companies and researchers to develop more robust and accurate models.
β
β
Challenges ahead
However, challenges remain, particularly in terms of maintaining accuracy as data structures evolve, requiring ongoing adjustments to models to take account of new information introduced or to be introduced. Manual annotation remains essential for providing accurate references and for the validation of automatic annotations, especially in fields where nuance and context are important.
β
β
Although automatic annotation tools offer significant advantages in terms of speed and cost, they should not be considered as a complete solution without human supervision. The integration of human checks and the strategic use of automatic annotation as part of a wider annotation workflow is essential to maintain the quality and reliability of annotated data.
β
β
Enhancing manual annotation with artificial intelligence (AI): when is it relevant?
β
β
When to use manual vs. automatic annotation?
The appropriateness of using AI methods to structure data depends closely on the volume of data to be processed. For example, when analyzing responses to a questionnaire with a relatively modest volume of data, it may make more sense to opt for a manual annotation approach.
β
This method, although time-consuming, can precisely meet the objectives of analyzing the themes addressed by annotators (or survey respondents, for example). It is important to note that determining the appropriateness of the volume of data required to develop an AI is not based solely on a fixed threshold for the number of documents, but rather on criteria such as the nature and length of the documents and the complexity of the annotation task.
β
Machine learning can be applied to improve manual annotation, enabling systems to learn from each annotation task to become more accurate and efficient. Integrating AI into data annotation processes significantly improves the efficiency and accuracy of manual annotation, underlining its importance in the development of accurate and efficient AI and machine learning models.
β
However, when faced with a large volume of documents or a continuous flow of data, automation of the annotation process generally becomes a relevant option. In these situations, the annotation phase aims to annotate only a portion of the documents initially, depending on the nature of the documents and the complexity of the task.
β
Partial annotation of the data can be used to train a supervised algorithm, enabling efficient automation of annotation across the entire corpus. Be careful, however, not to imagine that the automatic annotation task is self-sufficient. Generally speaking, it will produce pre-labeled data that needs to be qualified by annotators annotators to be exploitable by an AI model.
β
β
How to implement AI technologies in annotation cycles?
The implementation of AI technologies in data annotation projects is important insofar as it contributes to the quality of training data and the performance of AI and machine learning models. The annotation task becomes more focused for annotatorsmaking their work more efficient. The integration of data such as speech recognition is a good example of how AI-enhanced annotation can handle various types of data, including those derived from natural languageto help understand and classify information reliably.
β
A frequently recommended approach is to useActive Learning in annotation processes, to improve the working conditions and efficiency of annotators. Active Learning consists in intelligently selecting the most informative examples for the algorithm, in order to progressively improve its performance.
β
By integrating Active Learning into the manual annotation process, we can optimize the process by specifically targeting the most complex or ambiguous data, helping to increase the efficiency and accuracy of the algorithm over time.
β
Take, for example, a real estate ad annotation task (30 to 40 labels on average for each 500-word ad). By integrating Active Learning after annotating 2,000 texts, pre-annotated data will be generated. This pre-annotated data will then be submitted to the annotators for manual qualification, i.e. they will have the task of checking and correcting pre-annotation errors, rather than manually annotating the 30 to 40 labels mentioned above, for the remaining 5,000 ads, for example.
β
β
What tools can I use to make my manual data annotation processes more efficient?
β
β
1. Collaborative annotation platforms
β
Introduction to collaboration and project management
For manual data annotation projects, efficiency can be greatly improved through the use of collaborative platforms that allow multiple annotators to work simultaneously on the same dataset. Tools such as LabelBox offer features that facilitate task allocation and real-time progress monitoring.
β
Key features and benefits
These platforms often integrate project management functions, enabling supervisors to track progress, assign specific tasks and monitor annotation quality on an ongoing basis. The user interface of these tools is designed to minimize human error and maximize productivity through keyboard shortcuts, customizable mark-up templates, and simplified review options.
β
β
2. Using Artificial Intelligence to assist manual annotation
β
AI assistance techniques
Integrating AI into manual annotation processes can considerably speed up work while maintaining high accuracy. For example, tools such as Snorkel AI use weak supervision approaches to automatically generate preliminary annotations that annotators can then review and refine.
β
Advantages of the hybrid approach
A hybrid method using both manual annotation and automated workflows not only reduces the time spent annotating each piece of data, but also improves the consistency of annotated data by proposing initial labels based on advanced machine learning algorithms.
β
β
β
3. Revision and quality control systems
β
Importance of quality control
Quality control is essential in any data annotation process to ensure the reliability and usefulness of annotated data. Integrating review systems where annotations are regularly checked and validated by other team members or supervisors can help maintain the high quality standards needed for model training.
β
Revision tools and methods
Features like built-in comments, change histories, and alerts for inconsistencies are key elements that platforms like Prodigy and LightTag offer to facilitate text annotation processes, for example. These tools also produce detailed metrics on annotator performance, helping to identify training or continuous improvement needs.
β
β
β
4. Training and ongoing support for annotators
β
The role of training
Ongoing training for annotators plays an important role in improving the quality of annotated data. Offering regular training sessions and learning resources for annotators can help align their understanding of annotation criteria and increase their efficiency. We can't stress this enough: before hiring a data-labeling provider, think about formalizing an annotation manual!
β
Using online resources and tutorials
Platforms such as Coursera and Udemy offer specific courses on data annotation that can be useful. In addition, video tutorials and step-by-step guides available on these annotation platforms can also be valuable resources.
β
β
β
The importance of ethical responsibilities in Data Labeling
β
Guaranteeing fair and equitable practices
It is important to consider one's ethical responsibilities when it comes to Data Labelingto ensure fair and equitable practices in the development of AI models. Ensuring an ethical data annotation process means putting in place safe, sustainable and fair employment practices for those carrying out this work, taking care to offer them dignified working conditions and fair remuneration. Annotation work is often seen as a laborious and degrading task: we believe it is a vector for job creation and development in countries where opportunities are sometimes few and far between.
β
Furthermore, diversity and inclusion must be at the heart of annotation practices to avoid the introduction of biases that could negatively affect the fairness and representativeness of AI models. This means integrating diverse perspectives and maintaining an inclusive environment among data annotation teams, so that all cultures and individuals affected by AI models are fairly represented.
β
β
Detecting and reducing model bias
In addition, it is essential to adopt proactive measures to detect and reduce bias from the earliest stages of the collection data collection and processing. This includes employing pre-processing techniques to balance datasets, and using post-processing methods to adjust models to minimize persistent biases.
β
For these efforts to be effective, it is advisable to set up an ongoing evaluation and feedback system, enabling the accuracy and precision of annotations to be regularly monitored and improved. Regular data audits can be beneficial, offering an independent perspective on annotation practices and helping to maintain greater accountability and transparency.
β
β
In short, adopting these ethical practices in data annotation is not only a legal or moral necessity, but also an essential component in the development of fair and reliable AI technologies.
β
β
Recognizing the true value of Data Labeling work
Finally, it is essential to recognize that for many Data Labelers around the world, artificial intelligence offers significant opportunities for professional and economic development.
β
In many countries (such as Madagascar), jobs in the field of Data Labeling provide a stable source of income and enable individuals to acquire valuable technical skills in a fast-growing sector. These opportunities can be particularly valuable in regions where traditional employment options are limited or declining.
β
Companies employing Data Labelers therefore have a responsibility to maximize these opportunities by providing not only fair and safe working conditions, but also training and opportunities for advancement.
β
In so doing, they contribute not only to improving the living conditions of their employees, but also to promoting local economic development. This creates a virtuous circle where technological advances benefit not only companies, but also the communities that support these technologies through their daily work.
β
β
Conclusion
β
The balance between manual and automatic annotation can be adjusted to the specific requirements of each data annotation campaigns and and artificial intelligence projects. A dynamic approach that evolves over time is essential.
β
In this context, Innovatiana stands out by offering a complete solution through its services and its "CUBE" platform, accessible at https://dashboard.innovatiana.com. This platform provides access to labeled data on demand, to meet the varied needs of projects, while offering the possibility of reinforcing labeling teams by mobilizing our team of Data Labelers.
β
And so.., Innovatiana is fully in line with a dynamic and progressive vision of annotation within artificial intelligence projects, offering a complete and adapted response to current challenges. Selecting a company specialized in data annotation, or "tagging", is important for the success of AI projects. It's up to you to select the right partner to build your datasets and obtain accurate, reliable AI models!