ICASSP 2023
Multitrack Music Transformer
Hao-Wen Dong
Ke Chen
Shlomo Dubnov
Julian McAuley
Taylor Berg-Kirkpatrick
University of California San Diego
paper demo video slides code reviews
Model | Instrument control | Compound tokens | Average sample length (second) | Inference speed (notes per second) |
---|---|---|---|---|
MMT (ours) | ✓ | ✓ | 100.42 | 11.79 |
MMM | ✕ | ✕ | 38.69 | 5.66 |
REMI+ | ✕ | ✕ | 28.69 | 3.58 |
Note: All the samples are generated in single pass through the model using a sequence length of 1024. Thus, the generated music is usually shorter for a more complex ensemble than a simple ensemble.
Settings: Only a `start-of-song’ event is provided to the model. The model generates the instrument list and subsequently the note sequence.
Settings: The model is given a ‘start-of-song’ event followed by a sequence of instrument codes and a ‘start-of-notes’ event to start with. The model then generates the note sequence.
Ensemble: piano, church-organ, voices | |
Ensemble: contrabass, harp, english-horn, flute | |
Ensemble: trumpet, trombone | |
Ensemble: church-organ, viola, contrabass, strings, voices, horn, oboe |
Settings: All instrument and note events in the first 4 beats are provided to the model. The model then generates subsequent note events that continue the input music.
Settings: Only a `start-of-song’ event is provided to the model. The model generates the instrument list and subsequently the note sequence.
Sample 1 | Sample 2 | Sample 3 | |
---|---|---|---|
MMT (ours) | |||
MMM | |||
REMI+ |
Settings: The model is given a ‘start-of-song’ event followed by a sequence of instrument codes and a ‘start-of-notes’ event to start with. The model then generates the note sequence.
Sample 1 | Sample 2 | Sample 3 | |
---|---|---|---|
MMT (ours) |
Settings: All instrument and note events in the first 4 beats are provided to the model. The model then generates subsequent note events that continue the input music.
Sample 1 | Sample 2 | Sample 3 | |
---|---|---|---|
MMT (ours) | |||
Ground truth |
Settings: All instrument and note events in the first 16 beats are provided to the model. The model then generates subsequent note events that continue the input music.
Sample 1 | Sample 2 | Sample 3 | |
---|---|---|---|
MMT (ours) | |||
Ground truth |
Hao-Wen Dong, Ke Chen, Shlomo Dubnov, Julian McAuley and Taylor Berg-Kirkpatrick, “Multitrack Music Transformer,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
@inproceedings{dong2023mmt,
author = {Hao-Wen Dong and Ke Chen and Shlomo Dubnov and Julian McAuley and Taylor Berg-Kirkpatrick},
title = {Multitrack Music Transformer},
booktitle = {Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year = 2023,
}
Jeff Ens and Philippe Pasquier, “MMM: Exploring conditional multi-track music generation with the transformer,” arXiv preprint arXiv:2008.06048, 2020. ↩
Dimitri von Rütte, Luca Biggio, Yannic Kilcher, and Thomas Hofmann, “FIGARO: Generating symbolic music with fine-grained artistic control,” arXiv preprint arXiv:2201.10936, 2022. ↩