CLIPSonic

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Hao-Wen Dong1,2*   Xiaoyu Liu1Jordi Pons1Gautam Bhattacharya1Santiago Pascual1Joan Serrà1
Taylor Berg-Kirkpatrick2Julian McAuley2
1 Dolby Laboratories   2 University of California San Diego
* Work done during an internship at Dolby

paper demo video slides reviews


Content


Summary of the Compared Models

Text-to-Audio Models

Model Without text-audio pairs Training queries Test queries
CLIPSonic-ZS Image Text
CLIPSonic-PD Image Text
CLIPSonic-SD X Image Text
CLIP-TTA X Text Text
CLAP-TTA X Text Text

Image-to-Audio Models


Best Samples

Here are some of the best samples generated by CLIPSonic-PD, our best-performing model that requires no text-audio data for training.


Listening Test Samples for Text-to-Audio Synthesis on VGGSound

We convert the labels into pseudo text in the form of “a photo of [label]”, e.g., “a photo of rapping”.

  rapping people eating apple sea waves
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth

We notice that CLIPSonic-ZS fails to generate the sound of “people eating apple”. Rather, it generates some musical sound. This showcases the effectiveness of the diffusion prior model used in CLIPSonic-PD and CLIPSonic-SD.

  people marching vacuum cleaner cleaning floors playing table tennis
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth

We notice that CLIPSonic-ZS fails to generate the sound of “table tennis”. Interestingly, it generates a sound similar to xylophone. This showcases the effectiveness of the diffusion prior model used in CLIPSonic-PD and CLIPSonic-SD.

  playing violin fiddle playing marimba xylophone dog barking
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth

Listening Test Samples for Text-to-Audio Synthesis on MUSIC

We convert the labels into pseudo text in the form of “a photo of [label]”, e.g., “a photo of trumpet”.

  trumpet drums tuba
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth
  xylophone acoustic guitar cello
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth
  congas electric bass violin
CLIPSonic-ZS
CLIPSonic-PD
CLIPSonic-SD
Ground Truth

Listening Test Samples for Image-to-Audio Synthesis

The query images and audio samples for the im2wav and SpecVQGAN models are copied from the website of the im2wav paper. Note that the query images are “out-of-distribution” samples.

  01_Guitar 02_ElectricGuitar 03_Metal
CLIPSonic-IQ (ours)
SpecVQGAN
im2wav
  04_Bass 05_Police 06_Car
CLIPSonic-IQ (ours)
SpecVQGAN
im2wav
  07_Drums 08_Snare 09_Bongo
CLIPSonic-IQ (ours)
SpecVQGAN
im2wav
  10_Dog 11_Lion 12_Bird
CLIPSonic-IQ (ours)
SpecVQGAN
im2wav
  13_Rain 14_Train 15_Frog
CLIPSonic-IQ (ours)
SpecVQGAN
im2wav

More Examples on VGGSound

The videos might not play properly on Safari. Please use Chrome for the best experience.

CLIPSonic-IQ CLIPSonic-ZS
CLIPSonic-PD CLIPSonic-SD
CLIP-TTA CLAP-TTA

More Examples on MUSIC

The videos might not play properly on Safari. Please use Chrome for the best experience.

CLIPSonic-IQ CLIPSonic-ZS
CLIPSonic-PD CLIPSonic-SD
CLIP-TTA CLAP-TTA

Citation

Hao-Wen Dong, Xiaoyu Liu, Jordi Pons, Gautam Bhattacharya, Santiago Pascual, Joan Serrà, Taylor Berg-Kirkpatrick, and Julian McAuley, “CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models,” arXiv preprint arXiv:2306.09635, 2023.

@article{dong2023clipsonic,
    author = {Hao-Wen Dong and Xiaoyu Liu and Jordi Pons and Gautam Bhattacharya and Santiago Pascual and Joan Serrà and Taylor Berg-Kirkpatrick and Julian McAuley},
    title = {CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models},
    journal = {arXiv preprint arXiv:2306.09635},
    year = 2023,
}

  1. Vladimir Iashin and Esa Rahtu, “Taming Visually Guided Sound Generation,” Proc. BMVC, 2021. 

  2. Roy Sheffer and Yossi Adi, “I Hear Your True Colors: Image Guided Audio Generation,” Proc. ICASSP, 2023.