muspy.datasets

Dataset classes.

This module provides an easy-to-use dataset management system. Each supported dataset in MusPy comes with a class inherited from the base MusPy Dataset class. It also provides interfaces to PyTorch and TensorFlow for creating input pipelines for machine learning.

Base Classes

  • ABCFolderDataset

  • Dataset

  • DatasetInfo

  • FolderDataset

  • RemoteABCFolderDataset

  • RemoteDataset

  • RemoteFolderDataset

  • RemoteMusicDataset

  • MusicDataset

Dataset Classes

  • EssenFolkSongDatabase

  • EMOPIADataset

  • HaydnOp20Dataset

  • HymnalDataset

  • HymnalTuneDataset

  • JSBChoralesDataset

  • LakhMIDIAlignedDataset

  • LakhMIDIDataset

  • LakhMIDIMatchedDataset

  • MAESTRODatasetV1

  • MAESTRODatasetV2

  • Music21Dataset

  • MusicNetDataset

  • NESMusicDatabase

  • NottinghamDatabase

  • WikifoniaDataset

class muspy.datasets.ABCFolderDataset(root, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None)[source]

Class for datasets storing ABC files in a folder.

See also

muspy.FolderDataset

Class for datasets storing files in a folder.

read(filename)[source]

Read a file into a Music object.

on_the_fly()[source]

Enable on-the-fly mode and convert the data on the fly.

Returns

Return type

Object itself.

class muspy.datasets.Dataset[source]

Base class for MusPy datasets.

To build a custom dataset, it should inherit this class and overide the methods __getitem__ and __len__ as well as the class attribute _info. __getitem__ should return the i-th data sample as a muspy.Music object. __len__ should return the size of the dataset. _info should be a muspy.DatasetInfo instance storing the dataset information.

classmethod info()[source]

Return the dataset infomation.

classmethod citation()[source]

Print the citation infomation.

save(root, kind='json', n_jobs=1, ignore_exceptions=True, verbose=True, **kwargs)[source]

Save all the music objects to a directory.

Parameters
  • root (str or Path) – Root directory to save the data.

  • kind ({'json', 'yaml'}, default: 'json') – File format to save the data.

  • n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.

  • ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.

  • verbose (bool, default: True) – Whether to be verbose.

  • **kwargs – Keyword arguments to pass to muspy.save().

split(filename=None, splits=None, random_state=None)[source]

Return the dataset as a PyTorch dataset.

Parameters
  • filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.

  • splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.

  • random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.

to_pytorch_dataset(factory=None, representation=None, split_filename=None, splits=None, random_state=None, **kwargs)[source]

Return the dataset as a PyTorch dataset.

Parameters
  • factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.

  • representation (str, optional) – Target representation. See muspy.to_representation() for available representation.

  • split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.

  • splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.

  • random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.

Returns

Converted PyTorch dataset(s).

Return type

class:torch.utils.data.Dataset` or Dict of :class:torch.utils.data.Dataset`

to_tensorflow_dataset(factory=None, representation=None, split_filename=None, splits=None, random_state=None, **kwargs)[source]

Return the dataset as a TensorFlow dataset.

Parameters
  • factory (Callable, optional) – Function to be applied to the Music objects. The input is a Music object, and the output is an array or a tensor.

  • representation (str, optional) – Target representation. See muspy.to_representation() for available representation.

  • split_filename (str or Path, optional) – If given and exists, path to the file to read the split from. If None or not exists, path to save the split.

  • splits (float or list of float, optional) – Ratios for train-test-validation splits. If None, return the full dataset as a whole. If float, return train and test splits. If list of two floats, return train and test splits. If list of three floats, return train, test and validation splits.

  • random_state (int, array_like or RandomState, optional) – Random state used to create the splits. If int or array_like, the value is passed to numpy.random.RandomState, and the created RandomState object is used to create the splits. If RandomState, it will be used to create the splits.

Returns

  • class:tensorflow.data.Dataset` or Dict of

  • class:tensorflow.data.dataset` – Converted TensorFlow dataset(s).

class muspy.datasets.DatasetInfo(name=None, description=None, homepage=None, license=None)[source]

A container for dataset information.

class muspy.datasets.EMOPIADataset(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

EMOPIA Dataset.

get_raw_filenames()[source]

Return a list of raw filenames.

read(filename)[source]

Read a file into a Music object.

class muspy.datasets.EssenFolkSongDatabase(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

Essen Folk Song Database.

class muspy.datasets.FolderDataset(root, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None)[source]

Class for datasets storing files in a folder.

This class extends muspy.Dataset to support folder datasets. To build a custom folder dataset, please refer to the documentation of muspy.Dataset for details. In addition, set class attribute _extension to the extension to look for when building the dataset and set read to a callable that takes as inputs a filename of a source file and return the converted Music object.

root

Root directory of the dataset.

Type

str or Path

Parameters
  • convert (bool, default: False) – Whether to convert the dataset to MusPy JSON/YAML files. If False, will check if converted data exists. If so, disable on-the-fly mode. If not, enable on-the-fly mode and warns.

  • kind ({'json', 'yaml'}, default: 'json') – File format to save the data.

  • n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.

  • ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.

  • use_converted (bool, optional) – Force to disable on-the-fly mode and use converted data. Defaults to True if converted data exist, otherwise False.

Important

muspy.FolderDataset.converted_exists() depends solely on a special file named .muspy.success in the folder {root}/_converted/, which serves as an indicator for the existence and integrity of the converted dataset. If the converted dataset is built by muspy.FolderDataset.convert(), the .muspy.success file will be created as well. If the converted dataset is created manually, make sure to create the .muspy.success file in the folder {root}/_converted/ to prevent errors.

Notes

Two modes are available for this dataset. When the on-the-fly mode is enabled, a data sample is converted to a music object on the fly when being indexed. When the on-the-fly mode is disabled, a data sample is loaded from the precomputed converted data.

See also

muspy.Dataset

Base class for MusPy datasets.

property converted_dir

Path to the root directory of the converted dataset.

read(filename)[source]

Read a file into a Music object.

load(filename)[source]

Load a file into a Music object.

exists()[source]

Return True if the dataset exists, otherwise False.

converted_exists()[source]

Return True if the saved dataset exists, otherwise False.

get_converted_filenames()[source]

Return a list of converted filenames.

use_converted()[source]

Disable on-the-fly mode and use converted data.

Returns

Return type

Object itself.

get_raw_filenames()[source]

Return a list of raw filenames.

on_the_fly()[source]

Enable on-the-fly mode and convert the data on the fly.

Returns

Return type

Object itself.

convert(kind='json', n_jobs=1, ignore_exceptions=True, verbose=True, **kwargs)[source]

Convert and save the Music objects.

The converted files will be named by its index and saved to root/_converted. The original filenames can be found in the filenames attribute. For example, the file at filenames[i] will be converted and saved to {i}.json.

Parameters
  • kind ({'json', 'yaml'}, default: 'json') – File format to save the data.

  • n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.

  • ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.

  • verbose (bool, default: True) – Whether to be verbose.

  • **kwargs – Keyword arguments to pass to muspy.save().

Returns

Return type

Object itself.

class muspy.datasets.HaydnOp20Dataset(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

Haydn Op.20 Dataset.

read(filename)[source]

Read a file into a Music object.

class muspy.datasets.HymnalDataset(root, download=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None)[source]

Hymnal Dataset.

read(filename)[source]

Read a file into a Music object.

download()[source]

Download the source datasets.

Returns

Return type

Object itself.

class muspy.datasets.HymnalTuneDataset(root, download=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None)[source]

Hymnal Dataset (tune only).

read(filename)[source]

Read a file into a Music object.

download()[source]

Download the source datasets.

Returns

Return type

Object itself.

class muspy.datasets.JSBChoralesDataset(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

Johann Sebastian Bach Chorales Dataset.

read(filename)[source]

Read a file into a Music object.

class muspy.datasets.LakhMIDIAlignedDataset(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

Lakh MIDI Dataset - aligned subset.

read(filename)[source]

Read a file into a Music object.

class muspy.datasets.LakhMIDIDataset(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

Lakh MIDI Dataset.

read(filename)[source]

Read a file into a Music object.

class muspy.datasets.LakhMIDIMatchedDataset(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

Lakh MIDI Dataset - matched subset.

read(filename)[source]

Read a file into a Music object.

class muspy.datasets.MAESTRODatasetV1(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

MAESTRO Dataset V1 (MIDI only).

read(filename)[source]

Read a file into a Music object.

class muspy.datasets.MAESTRODatasetV2(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

MAESTRO Dataset V2 (MIDI only).

read(filename)[source]

Read a file into a Music object.

class muspy.datasets.MAESTRODatasetV3(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

MAESTRO Dataset V3 (MIDI only).

read(filename)[source]

Read a file into a Music object.

class muspy.datasets.Music21Dataset(composer=None)[source]

A class of datasets containing files in music21 corpus.

Parameters
  • composer (str) – Name of a composer or a collection. Please refer to the music21 corpus reference page for a full list [1].

  • extensions (list of str) – File extensions of desired files.

References

[1] https://web.mit.edu/music21/doc/about/referenceCorpus.html

convert(root, kind='json', n_jobs=1, ignore_exceptions=True)[source]

Convert and save the Music objects.

Parameters
  • root (str or Path) – Root directory to save the data.

  • kind ({'json', 'yaml'}, default: 'json') – File format to save the data.

  • n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.

  • ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.

class muspy.datasets.MusicDataset(root, kind=None)[source]

Class for datasets of MusPy JSON/YAML files.

Parameters
  • root (str or Path) – Root directory of the dataset.

  • kind ({'json', 'yaml'}, optional) – File formats to include in the dataset. Defaults to include both JSON and YAML files.

root

Root directory of the dataset.

Type

Path

filenames

Path to the files, relative to root.

Type

list of Path

See also

muspy.Dataset

Base class for MusPy datasets.

class muspy.datasets.MusicNetDataset(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

MusicNet Dataset (MIDI only).

read(filename)[source]

Read a file into a Music object.

class muspy.datasets.NESMusicDatabase(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

NES Music Database.

read(filename)[source]

Read a file into a Music object.

class muspy.datasets.NottinghamDatabase(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

Nottingham Database.

class muspy.datasets.RemoteABCFolderDataset(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

Base class for remote datasets storing ABC files in a folder.

See also

muspy.ABCFolderDataset

Class for datasets storing ABC files in a folder.

muspy.RemoteDataset

Base class for remote MusPy datasets.

class muspy.datasets.RemoteDataset(root, download_and_extract=False, overwrite=False, cleanup=False, verbose=True)[source]

Base class for remote MusPy datasets.

This class extends muspy.Dataset to support remote datasets. To build a custom remote dataset, please refer to the documentation of muspy.Dataset for details. In addition, set the class attribute _sources to the URLs to the source files (see Notes).

root

Root directory of the dataset.

Type

str or Path

Parameters
  • download_and_extract (bool, default: False) – Whether to download and extract the dataset.

  • overwrite (bool, default: False) – Whether to overwrite existing file(s).

  • cleanup (bool, default: False) – Whether to remove the source archive(s).

  • verbose (bool, default: True) – Whether to be verbose.

Raises

RuntimeError: – If download_and_extract is False but file {root}/.muspy.success does not exist (see below).

Important

muspy.Dataset.exists() depends solely on a special file named .muspy.success in directory {root}/_converted/. This file serves as an indicator for the existence and integrity of the dataset. It will automatically be created if the dataset is successfully downloaded and extracted by muspy.Dataset.download_and_extract(). If the dataset is downloaded manually, make sure to create the .muspy.success file in directory {root}/_converted/ to prevent errors.

Notes

The class attribute _sources is a dictionary storing the following information of each source file.

  • filename (str): Name to save the file.

  • url (str): URL to the file.

  • archive (bool): Whether the file is an archive.

  • md5 (str, optional): Expected MD5 checksum of the file.

  • sha256 (str, optional): Expected SHA256 checksum of the file.

Here is an example.:

_sources = {
    "example": {
        "filename": "example.tar.gz",
        "url": "https://www.example.com/example.tar.gz",
        "archive": True,
        "md5": None,
        "sha256": None,
    }
}

See also

muspy.Dataset

Base class for MusPy datasets.

exists()[source]

Return True if the dataset exists, otherwise False.

source_exists()[source]

Return True if all the sources exist, otherwise False.

download(overwrite=False, verbose=True)[source]

Download the dataset source(s).

Parameters
  • overwrite (bool, default: False) – Whether to overwrite existing file(s).

  • verbose (bool, default: True) – Whether to be verbose.

Returns

Return type

Object itself.

extract(cleanup=False, verbose=True)[source]

Extract the downloaded archive(s).

Parameters
  • cleanup (bool, default: False) – Whether to remove the source archive after extraction.

  • verbose (bool, default: True) – Whether to be verbose.

Returns

Return type

Object itself.

download_and_extract(overwrite=False, cleanup=False, verbose=True)[source]

Download source datasets and extract the downloaded archives.

Parameters
  • overwrite (bool, default: False) – Whether to overwrite existing file(s).

  • cleanup (bool, default: False) – Whether to remove the source archive(s).

  • verbose (bool, default: True) – Whether to be verbose.

Returns

Return type

Object itself.

class muspy.datasets.RemoteFolderDataset(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

Base class for remote datasets storing files in a folder.

root

Root directory of the dataset.

Type

str or Path

Parameters
  • download_and_extract (bool, default: False) – Whether to download and extract the dataset.

  • cleanup (bool, default: False) – Whether to remove the source archive(s).

  • convert (bool, default: False) – Whether to convert the dataset to MusPy JSON/YAML files. If False, will check if converted data exists. If so, disable on-the-fly mode. If not, enable on-the-fly mode and warns.

  • kind ({'json', 'yaml'}, default: 'json') – File format to save the data.

  • n_jobs (int, default: 1) – Maximum number of concurrently running jobs. If equal to 1, disable multiprocessing.

  • ignore_exceptions (bool, default: True) – Whether to ignore errors and skip failed conversions. This can be helpful if some source files are known to be corrupted.

  • use_converted (bool, optional) – Force to disable on-the-fly mode and use converted data. Defaults to True if converted data exist, otherwise False.

See also

muspy.FolderDataset

Class for datasets storing files in a folder.

muspy.RemoteDataset

Base class for remote MusPy datasets.

read(filename)[source]

Read a file into a Music object.

class muspy.datasets.RemoteMusicDataset(root, download_and_extract=False, overwrite=False, cleanup=False, kind=None, verbose=True)[source]

Base class for remote datasets of MusPy JSON/YAML files.

Parameters
  • root (str or Path) – Root directory of the dataset.

  • download_and_extract (bool, default: False) – Whether to download and extract the dataset.

  • overwrite (bool, default: False) – Whether to overwrite existing file(s).

  • cleanup (bool, default: False) – Whether to remove the source archive(s).

  • kind ({'json', 'yaml'}, optional) – File formats to include in the dataset. Defaults to include both JSON and YAML files.

  • verbose (bool. default: True) – Whether to be verbose.

root

Root directory of the dataset.

Type

Path

filenames

Path to the files, relative to root.

Type

list of Path

See also

muspy.MusicDataset

Class for datasets of MusPy JSON/YAML files.

muspy.RemoteDataset

Base class for remote MusPy datasets.

class muspy.datasets.WikifoniaDataset(root, download_and_extract=False, overwrite=False, cleanup=False, convert=False, kind='json', n_jobs=1, ignore_exceptions=True, use_converted=None, verbose=True)[source]

Wikifonia dataset.

read(filename)[source]

Read a file into a Music object.

muspy.datasets.get_dataset(key)[source]

Return a certain dataset class by key.

Parameters

key (str) – Dataset key (case-insensitive).

Returns

Return type

The corresponding dataset class.

muspy.datasets.list_datasets()[source]

Return all supported dataset classes as a list.

Returns

Return type

A list of all supported dataset classes.