Audio Wave
A large scale audio-visual diarisation dataset

VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos


The labels for the development set can be downloaded from here.
The wav files can be downloaded from here:
File MD5 Checksum
Dev WAV files Download 2a6e07e7473d9841abb132554a698a36
Test WAV files Download 834558bbd9b1ffd2d4893181556ceddd


The VoxConverse dataset is available to download for research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video.

In order to obtain videos with a large amount of overlapping speech, we used data consisting of political debates and news segments. The views and opinions expressed by speakers in the dataset are those of the individual speakers and do not necessarily reflect positions of the University of Oxford, Naver Corporation, KAIST or the authors of the paper.

We would also like to note that the distribution of identities in this dataset may not be representative the global human population. Please be careful of unintended societal, gender, racial, linguistic and other biases when training or deploying models trained on this data.


Please cite the following if you make use of the dataset.

  • Spot the conversation: speaker diarisation in the wild
    J. S. Chung*, J. Huh*, A. Nagrani*, T. Afouras, A. Zisserman
    Interspeech, 2020