Computer scienceData scienceNLPLibrariesHugging Face

Hugging Face datasets

7 minutes read

The Hugging Face Hub has a lot of datasets that may be useful for various ML purposes. In this topic, you will learn how to choose a dataset, use it for your project, and load your own dataset into Hugging Face.

Filtering datasets

When you go to the Hugging Face Datasets tab, you can choose the datasets that suit your needs by setting these filters:

  • Tasks and subtasks: The HF Hub has datasets not only for natural language processing but also for computer vision and audio processing, as well as datasets for multimodal models. Within these large groups, the datasets are also divided into smaller groups. For example, within the NLP group, one can choose a dataset for text or token classification, question answering, translation, text generation, and other tasks.
  • Languages. The language used in the data is an important parameter for datasets for NLP tasks. The datasets could be monolingual or multilingual.
  • Size. The size of datasets is specified in bytes.
  • Licenses that determine the ability to use the dataset.
  • Other parameters that can determine the specifics of data in datasets (for example, tags "music" or "medicine") or learning features ("Trained with Autotrain" parameter).

There is a screenshot of The Hugging Face website, where datasets are listed

By setting the filters correctly, you can generate a list of suitable datasets for your tasks. But how can you better understand what data is in this dataset, and is this dataset the right one for you? Let's talk about it in the next section.

Dataset's card

When you click on a dataset, you go to the dataset card page. The cards in Hugging Face Datasets refer to the documentation pages that provide detailed information about each dataset available in the library. These cards serve as a valuable resource for understanding the dataset's structure, content, and potential uses. They offer an overview of the dataset, including its name, description, size, licensing information, and any relevant citations. You can look at the sample data on their card, so you don't have to download the dataset to see the sample data. For example, this is what the preview of the glue dataset looks like.

There is a card of a dataset from The Hugging Face

The cards also provide information about the dataset's features, such as the number of examples, the available splits (train, validation, and test), and the format in which the data is stored. They often include sample code snippets demonstrating how to load and use the dataset in popular programming languages like Python.

Furthermore, the cards may contain additional metadata about the dataset, such as the source it was derived from, any associated tasks, and the evaluation metrics commonly used for that particular dataset.

The Hugging Face Datasets library makes it easy to access and work with these datasets by providing a unified API for loading, preprocessing, and exploring the data. It also offers functionalities like caching, shuffling, and filtering to streamline the dataset-handling process.

By leveraging the information provided in the dataset cards, developers and researchers can quickly get started with using a specific dataset, understand its characteristics, and integrate it into their NLP workflows.

Suppose you have selected the dataset you need. You could download a dataset manually as a bunch of files or load it using a dataset library. To do this, in the dataset card, you need to click on the "Use in dataset library" button and copy the code from the tab. For example, to download the glue dataset, you need to enter the following code. Note, that the second argument shows the config name of a subset to use.

from datasets import load_dataset

dataset = load_dataset("glue", "ax")

After running this code, the dataset will be saved as a variable. Simple, right?

But imagine the situation where you have your own unique dataset that you want to load onto the Hugging Face Hub. Let's see how you can do it!

Load a dataset

To load your dataset (or a dataset that exists on other resources), use the following code:

from datasets import load_dataset



dataset = load_dataset('csv', 
                       data_files={'train': ['your_train_data.csv'],
                                   'validation': ['your_valid_data.csv'],
                                   'test': ['your_test_data.csv']},
                       encoding='windows-1251', 
                       delimiter=';',
                       index_col = False
                      )

Note that this code is suitable for CSV files only. Your dataset should be divided into three files: your_train_data.csv, your_valid_data.csv, and your_test_data.csv. Define encoding, delimiter, and index_col only if you do this. In the case above, we used it because our imaginary dataset had such settings: it was encoded in windows-1251 (while the default one is utf-8), data in the dataset was divided by ; (while the default delimiter is,), and we wanted to avoid duplicating an index column that had already existed.

Before fine-tuning the model on the uploaded dataset, it's better to check what your dataset looks like:

print(dataset['train'])
print(dataset['your_valid_data.csv'])
print(dataset['test'])


##  Dataset({
##      features: ['Long text', 'Short text'],
##      num_rows: 856
##  })
##  Dataset({
##      features: ['Long text', 'Short text'],
##      num_rows: 150
##  })
##  Dataset({
##      features: ['Long text', 'Short text'],
##      num_rows: 240
##  })

This is what a dataset for the text summarization task may look like. You can save this dataset so that you don't need to upload it each time you work on a project.

path = 'your_profile_name/your_repo_name'  # create a repo in HF and paste a path name here

dataset.push_to_hub(path)

Now you and everyone else, if the repo is public, can easily access this dataset with this code:

dataset = load_dataset(path)

Conclusion

In this topic, you have learned more about The Hugging Face dataset, how to select them using filters, download datasets from HF, and load your datasets to the hub. Now, let's go to practice!

3 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo