How to download a dataset on Hugging Face

Downloading datasets from Hugging Face is straightforward using the Datasets library. Below is a step-by-step guide to help you download and use datasets from the Hugging Face Hub.

Step 1: Install the Datasets Library

If you haven’t already installed the datasets library, you can do so using pip. On Google Colab, use the following command to install:

Step 2: Load a Dataset

You can load a dataset using the load_dataset function. This function can download datasets from the Hugging Face Hub or load them from local files.

Download a Dataset from the Hugging Face Hub

To download a dataset from the Hugging Face Hub, simply specify the dataset name. For example, to load the IMDB dataset:

The above code downloads a popular dataset for sentiment analysis tasks, often used for training and evaluating machine learning models. Here’s what happens:

  1. Importing load_dataset: The load_dataset function is part of the datasets library, which provides access to various public datasets like IMDB, SQuAD, and more.
  2. Loading the IMDB dataset: The IMDB dataset contains movie reviews labeled as “positive” or “negative” for sentiment classification. When you run load_dataset("imdb"), it automatically downloads and processes the dataset.
  3. Printing the dataset: Executing print(dataset) will display an overview of the dataset, including:
    • The dataset splits (e.g., train, test).
    • The number of samples in each split.
    • Example data fields (e.g., text for reviews and label for sentiment).

Output

Your output shows the structure of the IMDB dataset as a DatasetDict, which organizes the dataset into different splits. Here’s a quick breakdown of what you’re seeing:

  • train: Contains 25,000 rows with features text (movie reviews) and label (sentiment, e.g., positive or negative).
  • test: Similarly has 25,000 rows for testing purposes, with the same features.
  • unsupervised: Contains 50,000 rows, but this split typically doesn’t have labels for sentiment analysis. It’s often used for tasks like pretraining or semi-supervised learning.

Load a Specific Split

You can load a specific split of the dataset (e.g., train, test, or validation):

Access Dataset Samples

You can access individual samples or slices of the dataset:

Step 3: Download a Dataset with Custom Configurations

Some datasets have multiple configurations or subsets. You can specify the configuration using the name parameter.

For example, the Wikipedia dataset has configurations for different languages:

Step 4: Download a Dataset from a Local File

If you have a dataset stored locally (e.g., in CSV, JSON, or text format), you can load it using the load_dataset function.

Load a CSV File

Load Multiple Files

You can load multiple files by passing a list of file paths:

Step 5: Stream Large Datasets

For very large datasets, you can use streaming mode to avoid loading the entire dataset into memory:

Step 6: Download a Dataset from the Hugging Face Hub Website

If you prefer to download datasets manually, you can do so from the Hugging Face Hub website:

  1. Go to the Hugging Face Hub: https://huggingface.co/datasets.
  2. Search for the dataset you want (e.g., imdb).
  3. Click on the dataset to open its page.
  4. Download the dataset files directly from the “Files” tab.

Step 7: Use the Downloaded Dataset

Once the dataset is downloaded, you can use it for training, evaluation, or analysis. Here’s an example of using the IMDB dataset for sentiment analysis:

Step 8: Save a Dataset Locally

If you want to save a dataset locally for offline use, you can do so using the save_to_disk() method:

We used the following commands above:

Download a dataset on Hugging Face


If you liked the tutorial, spread the word and share the link and our website Studyopedia with others.


For Videos, Join Our YouTube Channel: Join Now


Read More:

Hugging Face Tutorial
Top 10 Hugging Face datasets
Studyopedia Editorial Staff
contact@studyopedia.com

We work to create programming tutorials for all.

No Comments

Post A Comment

Discover more from Studyopedia

Subscribe now to keep reading and get access to the full archive.

Continue reading