CLIPSynth

CLIPSynth: Learning Text-to-audio Synthesis from Videos using CLIP and Diffusion Models

Hao-Wen Dong1,2*   Gunnar A. Sigurdsson1Chenyang Tao1Jiun-Yu Kao1Yu-Hsiang Lin1Anjali Narayan-Chen1
Arpit Gupta1Tagyoung Chung1Jing Huang1Nanyun Peng1,3Wenbo Zhao1
1 Amazon Alexa AI   2 University of California San Diego   3 University of California, Los Angeles
* Work done during an internship at Amazon

paper demo video slides


Content


Summary of the compared models

Model Generative Unlabeled data only Query type (training) Query type (test)
CLIPSynth Image Text
CLIPSynth-Text   Text Text
CLIPSynth-Hybrid   Image + Text Text
MiniLMSynth   Text Text
CLIPRetriever   - Text

Important notes


Example results on MUSIC

All the examples presented in this section use text queries, and they are randomly selected samples.

Examples

bassoon cello pipa acoustic guitar electric bass
clipsynth.png clipsynth.png clipsynth.png clipsynth.png clipsynth.png
erhu piano erhu bagpipe guzheng
clipsynth.png clipsynth.png clipsynth.png clipsynth.png clipsynth.png
bassoon bagpipe drum flute cello
clipsynth.png clipsynth.png clipsynth.png clipsynth.png clipsynth.png
clarinet acoustic guitar erhu pipa guzheng
clipsynth.png clipsynth.png clipsynth.png clipsynth.png clipsynth.png

Comparison 1

CLIPSynth CLIPSynth-Text CLIPSynth-Hybrid MiniLMSynth CLIPRetriever
clipsynth.png clipsynth_text.png clipsynth_hybrid.png minilm.png retrieval.png

Comparison 2

CLIPSynth CLIPSynth-Text CLIPSynth-Hybrid MiniLMSynth CLIPRetriever
clipsynth.png clipsynth_text.png clipsynth_hybrid.png minilm.png retrieval.png

Comparison 3

CLIPSynth CLIPSynth-Text CLIPSynth-Hybrid MiniLMSynth CLIPRetriever
clipsynth.png clipsynth_text.png clipsynth_hybrid.png minilm.png retrieval.png

Comparison 4

CLIPSynth CLIPSynth-Text CLIPSynth-Hybrid MiniLMSynth CLIPRetriever
clipsynth.png clipsynth_text.png clipsynth_hybrid.png minilm.png retrieval.png

Comparison 5

CLIPSynth CLIPSynth-Text CLIPSynth-Hybrid MiniLMSynth CLIPRetriever
clipsynth.png clipsynth_text.png clipsynth_hybrid.png minilm.png retrieval.png

Example results on VGG-Sound

All the examples presented in this section use text queries, and they are randomly selected samples.

Examples

people crowd people sniggering goat bleating baby laughter sharpen knife
clipsynth.png clipsynth.png clipsynth.png clipsynth.png clipsynth.png
playing marimba, xylophone car engine starting playing sitar sliding door engine accelerating, revving, vroom
clipsynth.png clipsynth.png clipsynth.png clipsynth.png clipsynth.png
child speech, kid speaking train horning helicopter male speech, man speaking dog bow-wow
clipsynth.png clipsynth.png clipsynth.png clipsynth.png clipsynth.png
ambulance siren playing acoustic guitar dog barking bowling impact pigeon, dove cooing
clipsynth.png clipsynth.png clipsynth.png clipsynth.png clipsynth.png

Comparison 1

CLIPSynth CLIPSynth-Text CLIPSynth-Hybrid MiniLMSynth CLIPRetriever
clipsynth.png clipsynth_text.png clipsynth_hybrid.png minilm.png retrieval.png

Comparison 2

CLIPSynth CLIPSynth-Text CLIPSynth-Hybrid MiniLMSynth CLIPRetriever
clipsynth.png clipsynth_text.png clipsynth_hybrid.png minilm.png retrieval.png

Comparison 3

CLIPSynth CLIPSynth-Text CLIPSynth-Hybrid MiniLMSynth CLIPRetriever
clipsynth.png clipsynth_text.png clipsynth_hybrid.png minilm.png retrieval.png

Comparison 4

CLIPSynth CLIPSynth-Text CLIPSynth-Hybrid MiniLMSynth CLIPRetriever
clipsynth.png clipsynth_text.png clipsynth_hybrid.png minilm.png retrieval.png

Comparison 5

CLIPSynth CLIPSynth-Text CLIPSynth-Hybrid MiniLMSynth CLIPRetriever
clipsynth.png clipsynth_text.png clipsynth_hybrid.png minilm.png retrieval.png

Image-queried Synthesis Demo

All the examples presented in this section use image queries, and they are randomly selected samples.

Examples on MUSIC

Examples on VGG-Sound


Out-of-distribution Generalization Experiments

In this experiment, we aim to examine the generalizability of the trained CLIPSynth model with unseen objects and combinatory prompts.

Experiment on CLIPSynth trained on MUSIC

Note: We can see that the model can generalize to unseen objects to some extent (viola, double bass, marimba, bongos are not presented in the MUSIC dataset). However, the model fails to handle combinatory inputs but generate the “average” sounds instead.

Experiment on CLIPSynth trained on VGG-Sound


Citation

Hao-Wen Dong, Gunnar A. Sigurdsson, Chenyang Tao, Jiun-Yu Kao, Yu-Hsiang Lin, Anjali Narayan-Chen, Arpit Gupta, Tagyoung Chung, Jing Huang, Nanyun Peng, and Wenbo Zhao, “CLIPSynth: Learning Text-to-audio Synthesis from Videos using CLIP and Diffusion Models,” Proceedings of the CVPR Workshop on Sight and Sound, 2023.

@inproceedings{dong2023clipsynth,
    author = {Hao-Wen Dong and Gunnar A. Sigurdsson and Chenyang Tao and Jiun-Yu Kao and Yu-Hsiang Lin and Anjali Narayan-Chen and Arpit Gupta and Tagyoung Chung and Jing Huang and Nanyun Peng and Wenbo Zhao},
    title = {CLIPSynth: Learning Text-to-audio Synthesis from Videos using CLIP and Diffusion Models},
    booktitle = {Proceedings of the CVPR Workshop on Sight and Sound},
    year = 2023,
}