CLIPSonic-ZS (zero-shot modality transfer): Our proposed CLIPSonic model queried with CLIP-text embeddings in a zero-shot setting
CLIPSonic-PD (pretrained diffusion prior): CLIPSonic queried with CLIP-image embeddings generated by a pretrained diffusion prior model
CLIPSonic-SD (supervised diffusion prior): CLIPSonic queried with CLIP-image embeddings generated by a diffusion prior model trained on the target dataset
CLIP-TTA: A baseline model that synthesizes a CLIP-text embedding into a mel spectrogram
CLAP-TTA: A baseline model that synthesizes a CLAP-text embedding into a mel spectrogram
Model
Without text-audio pairs
Training queries
Test queries
CLIPSonic-ZS
✓
Image
Text
CLIPSonic-PD
✓
Image
Text
CLIPSonic-SD
X
Image
Text
CLIP-TTA
X
Text
Text
CLAP-TTA
X
Text
Text
Image-to-Audio Models
CLIPSonic-IQ (image-queried): Our proposed CLIPSonic model queried with CLIP-image embeddings
SpecVQGAN: An image-to-audio synthesis model proposed by Iashin and Rahtu (2021) 1
im2wav: A state-of-the-art image-to-audio synthesis model proposed by Sheffer and Adi (2023) 2
Best Samples
Here are some of the best samples generated by CLIPSonic-PD, our best-performing model that requires no text-audio data for training.
Listening Test Samples for Text-to-Audio Synthesis on VGGSound
We convert the labels into pseudo text in the form of “a photo of [label]”, e.g., “a photo of rapping”.
rapping
people eating apple
sea waves
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth
We notice that CLIPSonic-ZS fails to generate the sound of “people eating apple”. Rather, it generates some musical sound. This showcases the effectiveness of the diffusion prior model used in CLIPSonic-PD and CLIPSonic-SD.
people marching
vacuum cleaner cleaning floors
playing table tennis
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth
We notice that CLIPSonic-ZS fails to generate the sound of “table tennis”. Interestingly, it generates a sound similar to xylophone. This showcases the effectiveness of the diffusion prior model used in CLIPSonic-PD and CLIPSonic-SD.
playing violin fiddle
playing marimba xylophone
dog barking
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth
Listening Test Samples for Text-to-Audio Synthesis on MUSIC
We convert the labels into pseudo text in the form of “a photo of [label]”, e.g., “a photo of trumpet”.
trumpet
drums
tuba
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth
xylophone
acoustic guitar
cello
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth
congas
electric bass
violin
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth
Listening Test Samples for Image-to-Audio Synthesis
The query images and audio samples for the im2wav and SpecVQGAN models are copied from the website of the im2wav paper. Note that the query images are “out-of-distribution” samples.
CLIPSonic-IQ (ours)
SpecVQGAN
im2wav
CLIPSonic-IQ (ours)
SpecVQGAN
im2wav
CLIPSonic-IQ (ours)
SpecVQGAN
im2wav
CLIPSonic-IQ (ours)
SpecVQGAN
im2wav
CLIPSonic-IQ (ours)
SpecVQGAN
im2wav
More Examples on VGGSound
The videos might not play properly on Safari. Please use Chrome for the best experience.
CLIPSonic-IQ
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
CLIP-TTA
CLAP-TTA
More Examples on MUSIC
The videos might not play properly on Safari. Please use Chrome for the best experience.
CLIPSonic-IQ
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
CLIP-TTA
CLAP-TTA
Citation
Hao-Wen Dong, Xiaoyu Liu, Jordi Pons, Gautam Bhattacharya, Santiago Pascual, Joan Serrà, Taylor Berg-Kirkpatrick, and Julian McAuley, “CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models,” arXiv preprint arXiv:2306.09635, 2023.
@article{dong2023clipsonic,author={Hao-Wen Dong and Xiaoyu Liu and Jordi Pons and Gautam Bhattacharya and Santiago Pascual and Joan Serrà and Taylor Berg-Kirkpatrick and Julian McAuley},title={CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models},journal={arXiv preprint arXiv:2306.09635},year=2023,}
Vladimir Iashin and Esa Rahtu, “Taming Visually Guided Sound Generation,” Proc. BMVC, 2021. ↩
Roy Sheffer and Yossi Adi, “I Hear Your True Colors: Image Guided Audio Generation,” Proc. ICASSP, 2023. ↩