CLIPSonic

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Hao-Wen Dong^1,2* Xiaoyu Liu¹ Jordi Pons¹ Gautam Bhattacharya¹ Santiago Pascual¹ Joan Serrà¹
Taylor Berg-Kirkpatrick² Julian McAuley²
¹ Dolby Laboratories ² University of California San Diego
* Work done during an internship at Dolby

paper demo video slides reviews

Content

Summary of the Compared Models
Best Samples
Listening Test Samples for Text-to-Audio Synthesis on VGGSound
Listening Test Samples for Text-to-Audio Synthesis on MUSIC
Listening Test Samples for Image-to-Audio Synthesis
More Examples on VGGSound
More Examples on MUSIC
Citation

Summary of the Compared Models

Text-to-Audio Models

CLIPSonic-ZS (zero-shot modality transfer): Our proposed CLIPSonic model queried with CLIP-text embeddings in a zero-shot setting
CLIPSonic-PD (pretrained diffusion prior): CLIPSonic queried with CLIP-image embeddings generated by a pretrained diffusion prior model
CLIPSonic-SD (supervised diffusion prior): CLIPSonic queried with CLIP-image embeddings generated by a diffusion prior model trained on the target dataset
CLIP-TTA: A baseline model that synthesizes a CLIP-text embedding into a mel spectrogram
CLAP-TTA: A baseline model that synthesizes a CLAP-text embedding into a mel spectrogram

Model	Without text-audio pairs	Training queries	Test queries
CLIPSonic-ZS	✓	Image	Text
CLIPSonic-PD	✓	Image	Text
CLIPSonic-SD	X	Image	Text
CLIP-TTA	X	Text	Text
CLAP-TTA	X	Text	Text

Image-to-Audio Models

CLIPSonic-IQ (image-queried): Our proposed CLIPSonic model queried with CLIP-image embeddings
SpecVQGAN: An image-to-audio synthesis model proposed by Iashin and Rahtu (2021) ¹
im2wav: A state-of-the-art image-to-audio synthesis model proposed by Sheffer and Adi (2023) ²

Best Samples

Here are some of the best samples generated by CLIPSonic-PD, our best-performing model that requires no text-audio data for training.

Listening Test Samples for Text-to-Audio Synthesis on VGGSound

We convert the labels into pseudo text in the form of “a photo of [label]”, e.g., “a photo of rapping”.

	rapping	people eating apple	sea waves
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth

We notice that CLIPSonic-ZS fails to generate the sound of “people eating apple”. Rather, it generates some musical sound. This showcases the effectiveness of the diffusion prior model used in CLIPSonic-PD and CLIPSonic-SD.

	people marching	vacuum cleaner cleaning floors	playing table tennis
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth

We notice that CLIPSonic-ZS fails to generate the sound of “table tennis”. Interestingly, it generates a sound similar to xylophone. This showcases the effectiveness of the diffusion prior model used in CLIPSonic-PD and CLIPSonic-SD.

	playing violin fiddle	playing marimba xylophone	dog barking
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth

Listening Test Samples for Text-to-Audio Synthesis on MUSIC

We convert the labels into pseudo text in the form of “a photo of [label]”, e.g., “a photo of trumpet”.

	trumpet	drums	tuba
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth

	xylophone	acoustic guitar	cello
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth

	congas	electric bass	violin
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth

Listening Test Samples for Image-to-Audio Synthesis

The query images and audio samples for the im2wav and SpecVQGAN models are copied from the website of the im2wav paper. Note that the query images are “out-of-distribution” samples.


CLIPSonic-IQ (ours)
SpecVQGAN
im2wav


CLIPSonic-IQ (ours)
SpecVQGAN
im2wav


CLIPSonic-IQ (ours)
SpecVQGAN
im2wav


CLIPSonic-IQ (ours)
SpecVQGAN
im2wav


CLIPSonic-IQ (ours)
SpecVQGAN
im2wav

More Examples on VGGSound

The videos might not play properly on Safari. Please use Chrome for the best experience.

CLIPSonic-IQ	CLIPSonic-ZS

CLIPSonic-PD	CLIPSonic-SD

CLIP-TTA	CLAP-TTA

More Examples on MUSIC

The videos might not play properly on Safari. Please use Chrome for the best experience.

CLIPSonic-IQ	CLIPSonic-ZS

CLIPSonic-PD	CLIPSonic-SD

CLIP-TTA	CLAP-TTA

Citation

Hao-Wen Dong, Xiaoyu Liu, Jordi Pons, Gautam Bhattacharya, Santiago Pascual, Joan Serrà, Taylor Berg-Kirkpatrick, and Julian McAuley, “CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models,” arXiv preprint arXiv:2306.09635, 2023.

@article{dong2023clipsonic,
    author = {Hao-Wen Dong and Xiaoyu Liu and Jordi Pons and Gautam Bhattacharya and Santiago Pascual and Joan Serrà and Taylor Berg-Kirkpatrick and Julian McAuley},
    title = {CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models},
    journal = {arXiv preprint arXiv:2306.09635},
    year = 2023,
}

Vladimir Iashin and Esa Rahtu, “Taming Visually Guided Sound Generation,” Proc. BMVC, 2021. ↩
Roy Sheffer and Yossi Adi, “I Hear Your True Colors: Image Guided Audio Generation,” Proc. ICASSP, 2023. ↩