Small datasets: how to maximize their use
In the fast-paced field of artificial intelligence, innovation and the quest for performance are constantly taking center stage. Recently, Chinese AI company π DeepSeek shook up the landscape by dethroning π ChatGPT as the most downloaded free app on Apple's App Store. Since its launch at the end of 2022, ChatGPT has indeed dominated the AI field, despite growing competition from giants such as Google, Meta and Anthropic. However, DeepSeek's meteoric rise signals a possible paradigm shift in the AI industry, as this model is already attracting attention not only for its impressive performance but also for its strategic approach to data.
β
Founded in July 2023 by Liang Wenfeng in Hangzhou, DeepSeek has rapidly made a name for itself. Recent benchmarks show that its third-generation language model (LLM V3) has outperformed those of major U.S. technology companies, while being developed at significantly lower costs, according to statements by its founders. This feat aroused keen interest and questions as to how a young start-up could achieve what seemed impossible. The answer, as Salesforce CEO Mark Benioff pointed out, lies not just in the technology itself, but in π the data and metadata that feed it. Calling DeepSeek"Deepgold", Benioff said, "The real value of AI lies not in the user interface or the model. Tomorrow's fortune? It's in our data!
β
This perspective highlights a growing awareness within the AI community: the importance of datasets, and small datasets in particular, in dispensing with costly, energy-intensive computing infrastructures. This is nothing new: several years ago, the emeritus Andrew Ng already raised this topic in his blog(π see the article at this address).
β
In short, while attention has long focused on model scale and computing power, the emphasis is now shifting to the quality and specificity of the data used to train these models. Small datasets, often underestimated in favor of large databases, have a unique potential to address niche applications, improve efficiency and enable AI development even in resource-constrained environments.
β
β
π‘ In this article, we'll explore why small datasets are becoming a cornerstone of AI progress, how they compare with large datasets in terms of utility and impact, and what lessons can be learned from pioneers like DeepSeek(who, incidentally, didn't necessarily use small datasets, but that's another debate since the training data used is not yet known at the time of writing!). Whether you're an AI enthusiast, a Data Scientist or simply curious, understanding the role of small datasets in AI developments offers valuable insights into the future of AI and its potential!
β
β
What is a Small Dataset?
β
In the world of massive data and artificial intelligence, we often hear about the importance of large datasets. However, small datasets play an equally important role in many fields. But what exactly do we mean by"small dataset"?
β
A small dataset is generally defined as a data set containing a relatively small number of observations or samples (i.e. little raw data, enriched with a limited amount of metadata). Although the exact definition may vary according to context, a dataset is generally considered "small" when it contains fewer than a few thousand entries. Such datasets may come from a variety of sources, such as scientific experiments, small-scale surveys, or data collections limited to a specific perimeter.
β
β
π‘ It's important to note that the size of a dataset is relative to the field of application and the problem to be solved. For example, in the field of genomics, a set of 1,000 DNA sequences might be considered small, whereas in a local sociological study, the same number of participants might be considered substantial. The notion of "small dataset" therefore depends on the context and standards specific to each discipline!
β
β
β
β
β
The advantages of small data sets
β
Contrary to what you might think, small data sets have many advantages that make them valuable in many situations. Here are just a few of these advantages:
β
1. Ease of collection and management
Small datasets are generally faster and less costly to collect. They require fewer resources in terms of time, money and manpower, making them accessible to more people.
β
2. Speed of analysis
With less data to process, analyses can be carried out more quickly, enabling more frequent iterations and adjustments in the AI research and development process.
β
3. Better understanding of data
Smaller data sets enable deeper exploration and a finer understanding of each data point. This can lead to valuable qualitative insights that might otherwise be lost when analyzing large volumes of data.
β
4. Flexibility and agility
Small datasets offer greater flexibility in experimentation and hypothesis adjustment. It's easier to modify parameters or redirect the study if necessary.
β
5. Noise reduction
In some cases, small data sets may contain π less noise or errorsespecially if they are carefully assembled and therefore more qualitative. These datasets can be used to develop more accurate and reliable models.
β
β
Challenges and limits of small data sets
β
Although small datasets offer many advantages, they are not without their challenges and limitations. Understanding these aspects is very important for using these datasets effectively:
β
1. Limited representativeness
One of the main challenges of small data sets is their limited ability to represent a larger population. This increases the risk of sampling bias, which can lead to erroneous conclusions if care is not taken.
β
2. Reduced statistical power
With less data, the statistical power of analyses is often reduced. This means it can be more difficult to detect subtle effects or draw statistically significant conclusions.
β
3. Sensitivity to outliers
Small datasets are more sensitive to outliers or measurement errors. A single erroneous data point can have a disproportionate impact on analysis results.
β
4. Limitations in the application of certain analysis techniques
Some advanced analysis techniques, particularly in the field of machine learning, require large volumes of data to be effective. Small data sets can limit the use of these methods.
β
5. Risk of overlearning
In the context of machine learning, models trained on small datasets are more likely to π overlearni.e. to adapt too closely to the training data, to the detriment of generalization.
β
β
Techniques for maximizing the use of small datasets
β
Faced with the challenges posed by small datasets, we have developed a number of techniques to make the most of them. Here are some of the approaches we frequently recommend to our customers:
β
1. Cross-validation
This technique is used to evaluate model performance on small data sets. It involves dividing the data into subsets, training the model on some and testing it on others, repeating the process several times. This provides a more robust estimate of model performance.
β
2. Data enhancement
In some fields, such as image processing, we can π artificially increase the size of the dataset by creating new instances from existing data. For example, by cropping, cropping or slightly modifying the original images.
β
3. Adjustment techniques
To avoid overlearning, we often use regularization methods such as L1 regularization(Lasso) or L2 regularization(Ridge). These techniques add a penalty to the model's loss function, encouraging simplicity and reducing the risk of overlearning.
β
4. Transfer learning
This approach, the π Transfer Learninginvolves using a pre-trained model on a large dataset and refining it on our small dataset. This allows us to benefit from the knowledge gained on large volumes of data, even when our own data is limited.
β
5. Using a classifier to enrich the dataset
Finally, a powerful strategy (which we're seeing more and more) is to exploit a π classifier to transform a small dataset into a larger one.
β
Example of approach:
- Select a representative subset of 5000 well-labeled samples.
- Train a classifier on these data to create an initial model. Then apply this classifier to a larger set of unlabeled data, in batches of 5000 samples.
- Correct errors manually after each iteration and monitor the improvement in model accuracy.
- Starting at around 70-80% accuracy, this iterative process enables the dataset to be progressively enriched, while reducing errors. This approach is ideal for situations where large-scale manual data collection is difficult or costly.
β
β
Application areas for small datasets
β
Small datasets are useful in many areas, often where large-scale data collection is difficult, time-consuming, costly or simply impossible. Here are a few areas where we frequently see the effective use of small datasets:
β
1. Medical research
In clinical trials, particularly for rare diseases, researchers often work with a limited number of patients. These small datasets are critical to understanding disease mechanisms and developing new treatments.
β
2. Ecology and conservation
Studies on rare or endangered species often involve small sample sizes. These limited data are nevertheless essential for biodiversity conservation and management.
β
3. Market research for small businesses
Small companies and startups often don't have the resources to conduct large-scale market research. They therefore rely on small datasets to gain insights into their customers and the market.
β
4. Psychology and behavioral sciences
Behavioral studies often involve relatively small sample sizes due to recruitment constraints and the complexity of experimental protocols.
β
5. Engineering and quality control
In product testing or quality control processes, we often work with limited samples for reasons of cost or time.
β
6. Astronomy
Despite technological advances, some rare astronomical phenomena can only be observed a limited number of times, resulting in precious small datasets.
β
7. Pilot studies and exploratory research
In many fields, pilot studies with small samples are used to test feasibility and refine hypotheses before embarking on larger-scale studies.
β
β
Comparison of small and large data sets
β
The comparison between small datasets and large datasets (or"big data") is a frequent topic of discussion in the world of data analysis. Each approach has its strengths and weaknesses, and the choice between the two often depends on the specific context of a study or project. Here's a comparison table highlighting the main differences:
β
β
β
β
It's important to note that these comparisons are general and may vary according to specific situations. In many cases, the ideal approach is to combine the advantages of both types of dataset:
- 1. Use small datasets for rapid exploratory analysis and pilot studies.
- 2. Validate hypotheses and models on larger data sets where possible.
- 3. Use intelligent sampling techniques to extract representative small datasets from large volumes of data.
β
β
πͺ Ultimately, the value of a dataset depends not only on its size, but also on its quality, its relevance to the question posed, and the way it is analyzed and interpreted.
β
β
Case studies - read in the press, some success stories with small data sets
β
To illustrate the power of small datasets, let's look at a few case studies where the judicious use of small datasets has led to significant discoveries or innovative applications:
β
1. Discovery of the exoplanet Trappist-1e
In 2017, a team of astronomers discovered a potentially habitable exoplanet, Trappist-1e, using a relatively small dataset. Their analysis was based on just 70 hours of observations from the Spitzer space telescope. Despite the limited size of the data, the researchers were able to accurately identify the characteristics of this planet.
β
2. Early prediction of Alzheimer's disease
A study led by researchers at the University of San Francisco used a small dataset of just 65 patients to develop a machine learning model capable of predicting Alzheimer's disease with 82% accuracy up to six years before clinical diagnosis. This study demonstrates how limited but high-quality data can lead to significant advances in the medical field.
β
3. Optimizing agricultural production
An agricultural startup used a π small dataset of 500 soil samples to develop a predictive model of crop quality. By combining this data with weather information and transfer learning techniques, this startup was able to create an accurate recommendation system for farmers, significantly improving yields in various regions.
β
4. Improving road safety
One municipality analyzed a dataset of just 200 road accidents to identify key safety issues. Despite the limited sample size, the in-depth analysis of each case enabled specific risk factors to be identified and targeted measures to be implemented, reducing the accident rate by 30% in one year.
β
5. Development of new materials
Materials science researchers have used a small dataset of 150 compounds to train a model for predicting the properties of new metal alloys. Using data augmentation and transfer learning techniques, they were able to successfully predict the characteristics of new materials, considerably speeding up the development process.
β
β
In conclusion: the growing importance of small data sets
β
At the end of our exploration of small datasets, it becomes clear that their importance in the data analysis landscape continues to grow. Although the era of big data has revolutionized many fields, not least artificial intelligence, we are seeing a renewed interest in small datasets and optimization, rather than the use of GPUs en masse, for several reasons:
β
- 1. Accessibility : small datasets are more accessible to a greater number of organizations and individuals. Small datasets therefore democratize the adoption and development of AI: AI is accessible to all!
- 2. Speed of iteration: they enable faster cycles of analysis and experimentation, essential in a world where agility is required.
- 3. Focus on quality: the use of small datasets encourages particular attention to the quality and relevance of each data point.
- 4. Ethics and confidentiality: in a context of growing concerns about data confidentiality, small datasets often offer a more ethical and less intrusive alternative.
- 5. Complementarity with big data: far from being in competition, small datasets and big data are often complementary, offering different and enriching perspectives.
- 6. Methodological innovation: the challenges posed by small datasets stimulate innovation in analysis methods, benefiting the entire field of data science.
β
Are you ready to harness the power of small datasets in your projects? π Contact us today to find out how we can develop datasets for you, whatever their size. Together, let's turn your data into actionable insights , training data for your AIs and competitive advantages!