VoxMovies is an audio dataset, containing utterances sourced from movies with varying emotion, accents and background noise.
To bechmark performance of speaker recognition systems on this entirely new domain, VoxMovies contains a number of domain adaptation evaluation sets.
VoxMovies contains speech from speakers in VoxCeleb1 and VoxCeleb2 (speaker recognition training datasets), allowing for domain change within the same identity to be investigated.
VoxMovies is sourced from key moments in a wide variety of movies from the Condensed Movies dataset. These movies cover many different genres such as comedy, action, romance and horror.
VoxMovies consists of audio clips. On average each identity has utterances from 2.7 different movies. Variation in emotion and background noise is therefore seen within each identity, as well as across identities.
Movie genres featured in VoxMovies
The VoxMovies dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.
Caution: We note that the distribution of identities in the VoxMovies dataset may not be representative of the global human population. Please be careful of unintended societal, gender, racial and other biases when training or deploying models trained or evaluated on this data.
Playing a Part: Speaker Verification at the Movies
A. Brown*, J. Huh*, A. Nagrani*, J. S. Chung, A. Zisserman
International Conference on Acoustics, Speech and Signal Processing, 2021