Running a data annotation campaign: the guide (1/2)
Why annotate images, videos, texts... what's the importance in AI?
To analyze the content of your data, train supervised algorithms and make a success of your artificial intelligence project, the use of "structured" or "annotated" data is essential.
If your data is already structured, this means that it has been previously organized in such a way as to be represented in tabular form, with rows corresponding to observations and columns corresponding to variables. By integrating a structuring process upstream, you benefit from a significant time saving, and it's likely that you won't need an annotation phase, as your data is already structured.
On the other hand, if your data is "unstructured", i.e. cannot be described by a predefined model, is not categorized and can be very diverse (images, text, video, etc.), it is highly likely that you will need to annotate this data. The unstructured nature of such data makes its exploitation by artificial intelligence algorithms much more complex. In such cases, you'll need to organize an annotation phase.
The annotation phase, which involves assigning one or more labels to elements in a dataset, thus creates a structured dataset, making it possible to train supervised algorithms.
Annotation is the process of assigning the most appropriate label to each piece of data. For example, this might involve assigning labels such as "dog" or "cat" to a collection of animal photographs, or selecting appropriate labels from "city", "type of housing", "purchase price" on a series of real estate advertisements.
The quality of your AI solution, in terms of relevance and performance, will be greatly influenced by the quality of the data, of which label accuracy is an important aspect, although other qualitative aspects may also play a role (such as completeness of explanatory variables, detection of outliers, etc.). It is therefore essential that the annotation phase is carried out with particular attention to obtaining high-quality labels. This guide presents the key steps and some best practices to guarantee this objective.
How do you prepare your data annotation campaign? Start by identifying your stakeholders
Running a text, image or video annotation campaign requires a specialized team, including annotators (or Data Labelers), a project manager, a Data Scientist and possibly an annotation platform administrator (labeling solution such as Label Studio or CVAT).
Below, a brief overview of the different profiles involved in annotation campaigns for AI:
The project manager (business expert)
The project manager, a business expert, plays an essential role in planning and monitoring the annotation process. His or her responsibilities include setting up the annotation schema and associated manual, training annotators, estimating the time needed for the various annotation tasks, setting up an annotation plan, and monitoring the quality and quantity of the project.
Data Scientist (Technical Expert)
The Data Scientist implements tools and methods to evaluate the progress and quality of annotations, for the needs of an AI model. They can also pre-annotate documents, prioritize annotations, and implement IT methods to speed up the annotation process. Upstream of annotation, the Data Scientist can define a data curation strategy, first working on the raw data to eliminate noise (e.g. : unreadableframes in a set of videos).
The annotation platform administrator
The platform administrator installs the annotation software, manages user accounts, makes documents available and prepares labelling environments, and regularly backs up annotations to avoid any loss of data. He/she also ensures the suitability of the solution and carries out all the technical tests required to exploit the data and metadata produced (e.g.: is it possible to extract complete data in JSON format with an appropriate level of performance).
Data annotators
The profile of annotators varies according to the annotation task. Some cases simply require a command of a language such as English or French, while others require specific expertise (e.g. knowledge of anatomy, specific expertise in the sports field, etc.). The annotators are tasked with understanding the task, annotating the documents, and reporting any questions or difficulties to the campaign manager as the annotation progresses.
Defining a problem
The annotation process, often a preliminary phase of a larger-scale AI project, requires in-depth reflection on the project's problematic before it actually begins. This precaution ensures that the annotations produced make an effective contribution to solving the project's specific problem.
The annotation process may vary depending on the application and the nature of the problem. Consequently, it is imperative to answer a series of essential questions:
- What problem does the project aim to solve?
- What is the overall context of the project and what public service mission does it support?
- What are the project's strategic objectives, and how do they align with the organization's goals?
- What are the project's operational objectives?
- What impact is the solution expected to have on service organization, both from the point of view of civil servants and users?
- Are there similar projects that could benefit from exploration?
- What is the scope of the proposed solution, and how does this affect the scope of the data to be annotated?
Creating a data annotation scheme
The annotation schema is a model used to describe the annotations in your project. It must be derived from the problem defined above. In concrete terms, it consists at least of a set of labels (i.e., terms used to characterize a given piece of information in a document) and a precise definition of these labels. For some projects, the annotation schema may also be defined by a hierarchy of labels, or by relationships between terms. All tags can be hierarchically linked. The annotation scheme is sometimes completed by a task to identify relationships between annotated entities (for example, an annotation task might be to link a pronoun to the noun to which it refers).
The business problem the project addresses is often complex, with many special cases or exceptions to the usual rules. Establishing an annotation schema often involves simplification (which also results in a loss of information or precision). It is important, however, not to oversimplify, and to strike the right balance between simplicity and relevance to the business problem. To achieve this balance, an iterative process is generally the best approach. If the purpose of annotation is to train an artificial intelligence algorithm, we must not exclude a priori specificities or instructions that would be too difficult for an automatic solution to reproduce.
Develop and update annotation campaign documentation
Documentation is a fundamental element and must evolve dynamically throughout the annotation campaign. By methodically recording the milestones reached and the challenges encountered, documentation proves to be a valuable tool for ensuring uniformity of information within the project team. It also plays a beneficial role in sharing lessons learned with other similar projects.
Various types of documentation, each targeting specific functions within the project, are essential: general documentation, documentation for annotators, and documentation specifically designed for the annotation platform administrator.
Guide for annotators
Documentation for annotators is of paramount importance as a training aid. It should include elements such as a detailed description of the project to provide a clear vision of the intended application, a hierarchical annotation schema where appropriate, and precise explanations of the various labels, including methodological choices and the logic underlying the annotation. Instructions on how to use the annotation software, concrete examples of particular cases and a Questions & Answers section all contribute to facilitating the annotation process.
Annotation platform administrator's guide (V7 Labs, Encord or CVAT)
Documenting how the annotation platform works is just as important. A specific guide for the platform administrator should explain how to create annotator accounts, upload documents, assign tasks, monitor progress, correct annotations, and export annotated documents. This documentation ensures smooth, efficient management of the platform throughout the annotation campaign.
(Continued guide available at this address).
Innovatiana stands out by offering a complete solution through its "CUBE" platform, accessible at https://dashboard.innovatiana.com. This platform offers a global response to the requirements of data collection and annotation within a single environment. By centralizing all the requirements linked to these processes, it is positioned as a unique solution for artificial intelligence projects. The platform provides a customized response to the specific requirements of each project. What's more, it offers the flexibility needed to reinforce labeling teams, thus fostering an efficient, collaborative approach. Innovatiana is fully in line with a dynamic and evolutionary perspective on annotation, providing a comprehensive solution tailored to meet the current challenges of artificial intelligence projects.