Multitrack Music Transformer

ICASSP 2023

Multitrack Music Transformer

Hao-Wen DongKe ChenShlomo DubnovJulian McAuleyTaylor Berg-Kirkpatrick
University of California San Diego

paper demo video slides code reviews


Content


Summary of the compared models

Model Instrument control Compound tokens Average sample length (second) Inference speed (notes per second)
MMT (ours) 100.42 11.79
MMM 38.69 5.66
REMI+ 28.69 3.58

Note: All the samples are generated in single pass through the model using a sequence length of 1024. Thus, the generated music is usually shorter for a more complex ensemble than a simple ensemble.


Best samples

Best unconditioned generation samples

Settings: Only a `start-of-song’ event is provided to the model. The model generates the instrument list and subsequently the note sequence.

Best instrument-informed generation samples

Settings: The model is given a ‘start-of-song’ event followed by a sequence of instrument codes and a ‘start-of-notes’ event to start with. The model then generates the note sequence.

Ensemble: piano, church-organ, voices
Ensemble: contrabass, harp, english-horn, flute
Ensemble: trumpet, trombone
Ensemble: church-organ, viola, contrabass, strings, voices, horn, oboe

Best 4-beat continuation samples

Settings: All instrument and note events in the first 4 beats are provided to the model. The model then generates subsequent note events that continue the input music.


Examples of unconditioned generation (unselected)

Settings: Only a `start-of-song’ event is provided to the model. The model generates the instrument list and subsequently the note sequence.

  Sample 1 Sample 2 Sample 3
MMT (ours)
MMM
REMI+

Examples of instrument-informed generation (unselected)

Settings: The model is given a ‘start-of-song’ event followed by a sequence of instrument codes and a ‘start-of-notes’ event to start with. The model then generates the note sequence.

  Sample 1 Sample 2 Sample 3
MMT (ours)

Examples of 4-beat continuation (unselected)

Settings: All instrument and note events in the first 4 beats are provided to the model. The model then generates subsequent note events that continue the input music.

  Sample 1 Sample 2 Sample 3
MMT (ours)
Ground truth

Examples of 16-beat continuation (unselected)

Settings: All instrument and note events in the first 16 beats are provided to the model. The model then generates subsequent note events that continue the input music.

  Sample 1 Sample 2 Sample 3
MMT (ours)
Ground truth

Citation

Hao-Wen Dong, Ke Chen, Shlomo Dubnov, Julian McAuley and Taylor Berg-Kirkpatrick, “Multitrack Music Transformer,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

@inproceedings{dong2023mmt,
    author = {Hao-Wen Dong and Ke Chen and Shlomo Dubnov and Julian McAuley and Taylor Berg-Kirkpatrick},
    title = {Multitrack Music Transformer},
    booktitle = {Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    year = 2023,
}

  1. Jeff Ens and Philippe Pasquier, “MMM: Exploring conditional multi-track music generation with the transformer,” arXiv preprint arXiv:2008.06048, 2020. 

  2. Dimitri von Rütte, Luca Biggio, Yannic Kilcher, and Thomas Hofmann, “FIGARO: Generating symbolic music with fine-grained artistic control,” arXiv preprint arXiv:2201.10936, 2022.