Knowledge

Datasets for text classification: our selection of the most reliable datasets

Written by

Daniella

Published on

2024-11-23

Reading time

This is some text inside of a div block.

min

📘 CONTENTS

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

We all know that large text data sets are important for training NLP or LLM models. In addition, text classification plays an essential role in the development of applications for 🔗 automatic natural language processing (NLP), enabling AI models to automatically categorize textual information.

‍

In this context, text classification datasets are essential resources for training and evaluating 🔗 Machine Learning models. Whether for 🔗 sentiment classification, subject categorization or spam detection, the quality and diversity of datasets directly influence model performance and reliability.

‍

💡 This article offers a selection of 15 well-known and recognized datasets, used and tested in the scientific and industrial community, ensuring a solid foundation for learning and evaluating text classification systems. And if you don't find what you're looking for... you can 🔗 contact us. We'd be delighted to tailor a dataset to help you achieve your goals!

‍

📚 Introduction to text classification

‍

Text classification is a fundamental task in natural language processing (NLP) and machine learning. It involves assigning one or more labels or categories to a text according to its content, style or context. This task is essential in many fields, such as information retrieval, sentiment classification, spam detection, content recommendation and so on.

‍

Text classification can be performed using various algorithms and models, such as 🔗 neural networks , decision trees, random forests, support vector machines (SVMs), etc. Each model has its own strengths and weaknesses, and the choice of the appropriate model depends on the type of data, the complexity of the task and the resources available.

‍

Why are datasets essential for text classification?

‍

Datasets are essential for text classification, as they provide Machine Learning models with structured examples that enable them to learn to recognize and differentiate text categories. In automatic natural language processing, a model needs to analyze large quantities of data to understand the linguistic and contextual nuances specific to each category.

‍

In concrete terms, for example, CSV files can be used to structure datasets for machine learning, specifying the columns required and the formats expected for data input into various models, notably for classification blocks.

‍

Without a well-constituted dataset, covering a wide range of cases and language variations, the model risks being inaccurate, generalizing or lacking relevance. In addition, datasets enable a model's performance to be tested and validated before it is used in real environments, ensuring that the model can reliably handle new data.

‍

They therefore contribute not only to the learning phase, but also to the evaluation phase, making it possible to continuously optimize text classification models for specific tasks, such as sentiment analysis, spam detection or document categorization.

‍

What are the characteristics of a reliable NLP dataset?

‍

A reliable dataset for automatic natural language processing (NLP) is characterized by several key features that guarantee its quality and usefulness for training and evaluating machine learning models.

‍

Sufficient size

A large dataset, including a diversity of cases, allows the model to learn varied linguistic nuances. This reduces the risk of 🔗 overlearning on specific examples and improves the model's ability to generalize.

‍

Linguistic and contextual variety

A good dataset contains samples from different contexts and language styles, whether formal, informal, various dialects or specific jargons. This variety enables the model to better adapt to differences in natural language.

‍

Precise, consistent labelling

Data must be labeled consistently and accurately, without errors or ambiguities. Reliable labeling enables the model to learn correctly how to classify texts into well-defined categories, be they sentiments, themes or other types of classification.

‍

Data representativeness

A reliable dataset must represent the actual use cases for which the model will be used. For example, for sentiment classification in social networks, it is essential that the dataset contains a sample of texts from similar platforms.

‍

Class balance

In a classification dataset, each class (or category) must be sufficiently represented to avoid bias. 🔗 A well-balanced dataset ensures that the model is not over-trained to detect more present categories at the expense of less frequent ones.

‍

Timeliness and relevance

As language evolves rapidly, a reliable dataset needs to be updated regularly to reflect changes in vocabulary, syntax and linguistic trends.

‍

These features ensure that the dataset is suitable for automatic natural language processing, enabling machine learning models to achieve optimal performance while remaining robust in the face of varied and new data.

‍

What are the 15 best datasets for text classification?

‍

Each dataset has its own specificities, adapted to particular objectives, whether 🔗 Sentiment Analysis moderation, spam detection or theme categorization.

‍

Here is our selection of 15 datasets commonly used for text classification, covering various use cases and classification types, and widely recognized for their reliability in automatic natural language processing.

‍

1. IMDB Reviews

This dataset includes movie reviews labeled as positive or negative. Its advantage lies in its size and popularity, making it a standard for sentiment classification. Its specificity is that it offers opinion-rich texts, ideal for models that need to understand the nuances of language in users' opinions.

‍

🔗 Link : Kaggle IMDB

‍

2. Amazon Reviews

Containing product reviews with satisfaction levels, this dataset is particularly useful for detecting multiple opinions and customer satisfaction. It is extensive, well-structured and includes metadata (product, rating, etc.), enabling in-depth analysis of purchasing behavior and user feedback.

‍

🔗 Link: Kaggle Amazon Reviews

‍

3. Yelp Reviews

With customer reviews of businesses, labeled from one to five stars, this dataset offers fine granularity for sentiment classification. Its particularity is that it contains information useful in the context of restaurants, hotels and local services, an asset for models targeting these sectors.

‍

🔗 Link: Yelp Reviews

‍

4. AG News

This dataset is commonly used to classify topics in news articles. It is structured into four categories (science, sports, business, technology), providing an excellent basis for NLP models focused on thematic classification or news analysis.

‍

🔗 Link : AG News

‍

5. 20 Newsgroups

A dataset made up of articles from 20 different newsgroups. Its main advantage lies in its thematic diversity, as it covers a wide range of topics, from science to leisure, which is invaluable for testing the models' ability to identify specific themes in heterogeneous corpora.

‍
🔗 Link : 20 Newsgroups

‍

6. DBpedia Ontology

This dataset comes from Wikipedia and covers over 500 thematic categories, perfect for document classification and knowledge enrichment tasks. Its richness and structuring enable models to be trained for complex encyclopedic content categorization tasks.

‍

🔗 Link : DBpedia Ontology

‍

7. SST (Stanford Sentiment Treebank)

A highly detailed dataset for sentiment analysis, with annotations at sentence and word level. Its granularity enables you to capture subtle feelings and form models capable of capturing nuances such as the progressive positivity or negativity in a review.

‍

🔗 Link: Stanford SST

‍

8. Reuters-21578

Often used in NLP research, this dataset contains articles classified by economic and financial topic. It is highly reliable for classifying financial and economic themes, an asset for companies and business intelligence applications.

‍

🔗 Link: Reuters-21578

‍

9. Twitter Sentiment Analysis Dataset

This dataset groups tweets labeled according to the sentiment they convey, often positive, negative or neutral. It's ideal for NLP models targeting social networks, as it includes informal language, abbreviations and short expressions specific to the tweet format.

‍

🔗 Link: Twitter Sentiment Analysis

‍

10. TREC (Text REtrieval Conference) Question Classification

Designed to classify questions into categories (e.g. place, person, number), this dataset is particularly useful for developing automatic response systems. Its advantage lies in its unique structure, which helps models to better understand the intentions of questions.

‍

🔗 Link : TREC

‍

11. News Category Dataset

This journalistic classification dataset brings together press articles from a variety of sources, providing a diversified and up-to-date basis for thematic classification or media content analysis models.

‍

🔗 Link : News Category Dataset

‍

12. SpamAssassin Public Corpus

This email corpus is used for spam detection. Its advantage is that it contains messages from a variety of contexts (phishing, promotions, etc.), enabling effective models to be formed for spam detection in email and messaging.

‍

🔗 Link: SpamAssassin

‍

13. Wikipedia Toxic Comments

This dataset is designed to detect toxic, insulting or hateful comments on public platforms. It helps develop models for content moderation applications, an increasingly important area in social media and forums.

‍

🔗 Link: Toxic Comments

‍

14. Emotion Dataset

This dataset is designed to classify emotions (joy, sadness, anger, etc.) in short messages. It is particularly well suited to sentiment analysis in social contexts, or for user assistance applications requiring a detailed understanding of emotions.

‍

🔗 Link: Emotion Dataset

‍

15. Enron Email Dataset

Comprising emails from the Enron company, this dataset is commonly used to analyze corporate exchanges, particularly in the context of fraud detection or internal communications management. Its specificity lies in the variety of its samples (responses, email chains), an asset for the analysis of relationships and subjects.

‍

🔗 Link: Enron Email Dataset

‍

Which datasets should be used for subject or category detection?

‍

For topic or category detection, several datasets stand out for their thematic diversity and classification-friendly structure. Here are the most relevant options:

‍

1. AG News
Composed of press articles classified into four main categories: science, sports, business and technology, this dataset is ideal for thematic classification tasks. Its size and simplicity make it an excellent starting point for models who need to learn how to identify various topics in news texts.

‍

2. 20 Newsgroups
This dataset brings together articles from 20 newsgroups, covering a wide range of topics such as science, politics, entertainment and technology. Its thematic richness makes it an ideal resource for training models to recognize categories in heterogeneous corpora and capture the particularities of each topic.

‍

3. DBpedia Ontology
Based on Wikipedia, this dataset is organized into several hundred thematic categories. Its level of detail makes it particularly suitable for document classification and encyclopedic content categorization tasks, ideal for projects requiring fine-grained categorization and knowledge enrichment.

‍

4. News Category Dataset
Composed of press articles from a variety of sources, this dataset is organized into journalistic categories. It's perfect for news text classification models, as it enables you to quickly identify the main themes in media articles, whether they relate to business, entertainment, politics or other topics.

‍

5. Reuters-21578
This dataset contains press articles classified mainly by economic and financial topics. It is widely used for business intelligence applications and economic research, enabling models to better understand specific themes in business, finance and industry.

‍

💡 These datasets offer valuable resources for topic detection, each tailored to particular types of content (press, forums, encyclopedias) and offering varying levels of detail depending on the needs of the model.

‍

What about datasets for classifying texts in several languages?

‍

Several multilingual datasets have been specifically designed to classify texts in several languages. These datasets enable machine learning models to learn to recognize and classify texts while taking linguistic diversity into account. Here are some of the most widely used:

‍

1. XNLI (Cross-lingual Natural Language Inference)
This dataset is designed for text comprehension and classification tasks in 15 languages, including French, Spanish, Chinese and Arabic. It is mainly used for entailment classification (meaning relations), but can be adapted for other classification tasks, particularly in multilingual contexts.

‍

2. MLDoc
Based on the Reuters RCV1/RCV2 corpus, this dataset contains news documents in eight languages (English, German, Spanish, French, etc.). It is organized into four main categories (business, entertainment, health, science) and is ideal for multilingual thematic classification, particularly useful for models who have to work in an international news environment.

‍

3. MARC (Multilingual Amazon Reviews Corpus)
This dataset includes Amazon product reviews in several languages (including English, German, French, Japanese, Spanish, etc.), tagged for sentiment classification. It is suitable for sentiment and opinion classification projects on international e-commerce platforms.

‍

4. Jigsaw Multilingual Toxic Comment Classification
Developed to identify toxic comments in several languages (English, Spanish, Italian, Portuguese, French, etc.), this dataset is particularly useful for content moderation tasks in multilingual contexts. It is often used to train models for detecting hate speech and other forms of toxicity.

‍

5. CC100
This dataset, part of the Common Crawl project, offers multilingual data from the web. Although not specifically labeled for thematic classification, it is large enough to extract and build multilingual sub-corpuses for specific text classification tasks.

‍

6. OPUS (Open Parallel Corpus)
OPUS is a collection of multilingual text resources bringing together data from a variety of sources, such as press sites, forums and international institutions. Although its content is varied, it can be used to create multilingual subsets for thematic or sentiment classification tasks, according to the user's needs.

‍

💡 These multilingual datasets enable researchers and other artificial intelligence enthusiasts to develop models capable of processing textual data in multiple languages, a valuable asset for international applications or for platforms that require global content management.

‍

Conclusion

‍

Text classification plays a central role in automatic natural language processing, and the choice of the right dataset is decisive for model performance and accuracy. Datasets provide a structured basis for training models to distinguish between sentiments, topics, categories, and even to understand linguistic nuances in multilingual contexts.

‍

Options like IMDB Reviews and Amazon Reviews stand out for sentiment analysis, while datasets like AG News and DBpedia Ontology are prime resources for thematic classification. What's more, 🔗 specific needs in moderation or hate speech detection find answers in datasets such as Wikipedia Toxic Comments and Jigsaw Multilingual Toxic Comment Classification, particularly suited to multilingual environments.

‍

Thanks to this diversity of resources, AI researchers and enthusiasts from all walks of life have access to tools tailored to the particularities of each project, whether for content moderation, opinion analysis or multilingual categorization. Ultimately, these datasets enable more robust AI models to be trained, better adapted to the varied requirements of text classification, guaranteeing a solid foundation and better results for the development of advanced NLP solutions.