VoxCeleb contains speech from speakers spanning a wide range of different ethnicities, accents, professions and ages.
1 million +utterances
All speaking face-tracks are captured "in the wild", with background chatter, laughter, overlapping speech, pose variation and different lighting conditions.
VoxCeleb consists of both audio and video. Each segment is at least 3 seconds long.
We provide URLs for each YouTube video and timestamps for utterances. The frame number provided assumes that the video is saved at 25fps.
The VoxCeleb dataset is available to download for research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.
VoxCeleb: a large-scale speaker identification dataset
A. Nagrani*, J. S. Chung*, A. Zisserman
VoxCeleb2: Deep Speaker Recognition
J. S. Chung*, A. Nagrani*, A. Zisserman
VoxCeleb: Large-scale speaker verification in the wild
A. Nagrani*, J. S. Chung*, W. Xie, A. Zisserman
Computer Speech and Language, 2019