Audio Wave
The VoxCeleb Speaker Recognition Challenge 2022
(VoxSRC-22)

Welcome to the 2022 VoxCeleb Speaker Recognition Challenge! The goal of this challenge is to probe how well current methods can recognize speakers from speech obtained 'in the wild'. The data is obtained from YouTube videos of celebrity interviews, as well as news shows, talk shows, and debates - consisting of audio from both professionally edited videos as well as more casual conversational audio in which background noise, laughter, and other artefacts are observed in a range of recording environments.

The workshop will be held in junction with Interspeech 2022.

Workshop page is now open. Please visit here for more information.

Timeline

The workshop deadline has been extended until 14th September 2022, 23:59:59 UTC.

July 12th Development set for verification tracks released.
July 19th Development set for diarisation tracks released.
Aug 8th Test set released and evaluation server open.
Sep 14th Deadline for submission of results; invitation to workshop speakers.
Sep 20th Deadline for submission of technical reports.
September 22nd Challenge workshop

Tracks


VoxSRC-22 will feature four tracks, including a brand new semi-supervised domain adaptation track. Track 1, 2 and 3 are speaker verification tracks, where the task is to determine whether two samples of speech are from the same person. Track 4 is a speaker diarisation track, where the task is to break up multi-speaker audio into homogenous single speaker segments, effectively solving ‘who spoke when’.

#
Description
Track 1
Fully supervised speaker verification (closed)
  • Train set : Participants can only use VoxCeleb2 dev dataset for which we have already released speaker labels.
  • Validation set : We provide the validation pairs in the section below.
Track 2
Fully supervised speaker verification (open)
  • Train set : Participants can use VoxCeleb2 dev dataset and any other data except the challenge test data
  • Validation set : We provide the validation pairs in the section below.
Track 3
Semi-supervised domain adaptation (closed)
  • Train set : Participants may train on (1) a large set of labelled data in a source domain (VoxCeleb2 dev dataset with speaker labels), (2) a large set of unlabelled data in a target domain (a subset of the CnCeleb2 dev set without speaker labels. The subset is defined by a text file provided below in the Data Section), and (3) a small set of labelled data in a target domain (a small set of CnCeleb data with speaker labels. We provide this below in the Data Section).
  • Validation set : We provide the validation pairs in the section below.
Track 4
Speaker diarisation (open)
  • Train set : Participants are allowed to use any data except the challenge test data
  • Validation set : We provide both the dev and test set of VoxConverse to use in validation. We have recently detected some error in our test set labels. Please use the new version (0.3) for this competition.

New focus for the fully supervised speaker verification tracks

This year, for the fully supervised tracks (1 & 2) we focus on two challenging settings. First, we focus on how speech segments taken from the same speaker at different ages impact speaker verification systems. Secondly, we focus on how speaker verification systems perform when speech segments from different speakers have the same background noise.

New Semi-Supervised Domain Adaptation

This year, we introduce a new track (track 3), focused on semi-supervised domain adaptation. Here, we are interested in the problem of how models, pre-trained on a large set of data with labels in a source domain, can adapt to a new target domain given: (1) a large set of unlabeled data from the target domain, and (2) a small set of labeled data from the target domain.

In this track, we are interested in the domain adaptation task in speaker verification from one language in a source domain, to a different language in a target domain. Specifically, the source domain consists of mainly English-speaking utterances, and the target domain consists of Chinese-speaking utterances. Here, we use the VoxCeleb2 data as the source domain, and Cn-Celeb data as the target domain. Note that we only use a specific subset of Cn-Celeb, as defined in the following Section.

VoxCeleb2 data consists of mainly interview-style utterances, whereas Cn-Celeb consists of several different genres of utterances. In order to focus on the language domain adaptation task, we have therefore removed utterances in the target domain from the “singing”, “play”, “movie”, “advertisement”, and “drama” genres. We thank the authors of CN-Celeb for allowing the use of their dataset for the target domain in this track.

Data


Speaker verification

This year, we follow the same protocol for tracks 1 & 2 as in previous years, and we introduce a new protocol and set of data requirements for track 3.

For the speaker verification tracks, we use the VoxCeleb and CN-Celeb dataset.

Training data: There are two closed tracks (1 and 3) and one open track (track 2) for speaker verification.
  • Tracks 1 & 2: For the closed track 1, participants can only use the VoxCeleb2 dev dataset. Please refer to this website to download the dataset. For the open track 2, participants can use any public data except the challenge test data.
  • Track 3: For the closed semi-supervised domain adaptation track 3, participants can pre-train their models using the VoxCeleb2 dev dataset (the source domain) using speaker labels. For adapting their models to the target domain, participants may fine-tune their models using (1) a large set of unlabelled data from the target domain. For this, we provide a subset of Cn-Celeb2 without speaker labels (this subset is defined in the file “Track 3 unsupervised target domain data”) and (2), a small set of labeled data from the target domain. For this, we provide a set of data from the target domain in the “Track 3 supervised target domain data” below.

Validation data:
  • Tracks 1 & 2: We provide a list of trial speech pairs from identities in the VoxCeleb1 and VoxConverse datasets. Each trial consists of two single-speaker speech segments, of variable length. Unlike previous challenges, this year we do not have multilingual or out-of-domain focuses. Instead we focus on the both the impact of age of the speaker on their speech segments, and also the impact of shared background noise between speech segments from different speakers. Please note that the filenames start with 'id1' are from VoxCeleb1 and the filenames start with 'VoxSRC2022_dev/' are from the addtitional wavfiles (cropped from VoxConverse) which are provided below. You need to download the VoxCeleb1 from this page.
  • Track 3: For track 3, we provide a list of trial speech pairs from identities in the target domain. Each trial consists of two single-speaker speech segments, of variable length.

Note : We've recently found some duplicates in Track 1 & 2 validation trial pairs. (1236 pairs) The fixed version of new trial pairs are uploaded as "Track 1 & 2 validation trial pairs(fixed)" so please download them.


File MD5 Checksum
VoxCeleb1 (required for Track 1 & 2 validation set) Download
Track 1 & 2 additional validation wavfiles Download 763be4988cea5ff0eea39081d881af1f
Track 1 & 2 validation trial pairs Download f70fd8138deb8312403dc35f802b0548
Track 1 & 2 validation trial pairs(fixed) Download c2c0bf75450ddf7fbeb5aca07ebc70ae
Track 3 unsupervised target domain data Download b0157d5cb961ecb1f5f617625fb843a1
Track 3 supervised target domain data Download 57170ba6c8c5223be0cefc6ab1b43e5f
Track 3 validation wavfiles Download 50fccc3315cf7b18d6575350d8fb043d
Track 3 validation trial pairs Download 97f71af121f620363f86070089adad02


Test data:The test set consists of a list of trial pairs and anonymized speech wavfiles. Below are the links to download both the trial list and speech segments.


File MD5 Checksum
Track 1 & 2 test wavfiles Download 92b469c92dedaa5cadeddcbc65d47be9
Track 1 & 2 test trial pairs Download 3ae427a650dc02303b7708ea520ddcf2
Track 3 test wavfiles Download b34afbdfdcb84f9b4af1c888b51222a3
Track 3 test trial pairs Download 24eff8237f06f1a89e535e9478f2061d

Speaker diarisation

Training data: Participants can use any data except the challenge test data.

Validation data: We provide both dev / test set of VoxConverse (ver 0.3) dataset.

Test data: We provide 360 wavfiles for test set. Please note that you have to submit one rttmfile which contains all predicted segments from our test data.

File MD5 Checksum
Track 4 test wavfiles Download ed940a2232461126d490de677ae15933

Evaluation Metrics


Speaker Verification

For the Speaker Verification tracks, we will display both the Equal Error Rate (EER) and the Minimum Detection Cost (CDet). For tracks 1 and 2, the primary metric for the challenge will be the Detection Cost, and the final ranking of the leaderboard will be determined using this score alone. For track 3, the primary metric is EER, as this is a more forgiving metric.

Equal Error Rate
This is the rate used to determine the threshold value for a system when its false acceptance rate (FAR) and false rejection rate (FRR) are equal.

Minimum Detection Cost
Compared to equal error-rate, which assigns equal weight to false negatives and false positives, this error-rate is usually used to assess performance in settings where achieving a low false positive rate is more important than achieving a low false negative rate. We follow the procedure outlined in Sec 3.1 of the NIST 2018 Speaker Recognition Evaluation Plan, for the AfV trials. To avoid ambiguity, we mention here that we will use the following parameters: C_Miss = 1, C_FalseAlarm = 1, and P_Target = 0.05

Speaker Diarisation

For the Speaker Diarisation track, we will display both the Diarisation Error Rate (DER) and the Jaccard Error Rate (JER), but the leaderboard will be ranked using the Diarisation Error Rate (DER) only.

Diarisation Error Rate
The Diarisation Error Rate (DER) is the sum of
1. speaker error - percentage of scored time for which the wrong speaker id is assigned within a speech region.
2. false alarm speech - percentage of scored time for which a nonspeech region is incorrectly marked as containing speech
3. missed speech - percentage of scored time for which a speech region is incorrectly marked as not containing speech.

We use a collar of 0.25 seconds and include overlapping speech in the scoring. For more details, consult section 6.1 of the NIST RT-09 evaluation plan.

Jaccard Error Rate
We also report the Jaccard error rate (JER), a metric introduced for the DIHARD II challenge that is based on the Jaccard index. The Jaccard index is a similarity measure typically used to evaluate the output of image segmentation systems and is defined as the ratio between the intersection and union of two segmentations. To compute Jaccard error rate, an optimal mapping between reference and system speakers is determined and for each pair the Jaccard index of their segmentations is computed. The Jaccard error rate is then 1 minus the average of these scores. For more details please consult Sec 3 of the DIHARD Challenge Report.


Code for computing all metrics on the validation data has been provided in the development toolkit.

Challenge registration

Four tracks will be held via Codalab platform. You need a Codalab account for registration, so please make it if you don't have one. Any researchers, whether in academia or industry, can participate in our challenge, but we only accept institutional emails to register. Please follow the instructions on each challenge website for submission.

CodaLab evaluation server are active now. Please visit the links below for participation.

Previous Challenges

Details of the previous challenges can be found below. You can also find the slides and presentation videos of the winners on the workshop websites.

Challenge Links
VoxSRC-19 challenge / workshop
VoxSRC-20 challenge / workshop
VoxSRC-21 challenge / workshop

FAQs

Q. Who is allowed to participate?
A. Any researcher, whether in academia or industry, is invited to participate in VoxSRC . We only request a valid official email address, associated with an institution for registration, once the registration system opens. This ensures we limit the number of submissions per team.

Q: Do I need to use the name of my institution or my real name as the team name for a submission?
A: No, you do not have to. The name of the CodaLab user (or the Team name, if you have set up one in CodaLab) that uploads the submission will be used in the public leaderboard. Hence if you do not want your details to be public, you should anonymise if appropriate. You must select a team name before the server's closing time.

Q: For the semi-supervised track 3, can I train on all of CnCeleb?
A: No. For track 3, participants may only train on (1) a subset of CnCeleb without labels (we provide the subset under the name “Track_3_unsupervised_target_domain_data.txt” in the “data” section), and (2) a small set of CnCeleb with labels (we provide this set under the name “Track 3 supervised target domain data” in the “data” section)

Q: For the semi-supervised track 3, can I use my model that I trained for the closed track 1?
A: Yes. For track 3, participants are allowed to train on the VoxCeleb2 dev dataset. So participants can use their model that was trained for track 1.

Q: For the semi-supervised track 3, can I use the CnCeleb validation set?
A: No. For track 3, participants can only use the provided validation set.

Q. Can I participate in only some tracks?
A. Yes, you can participate in as many tracks as you like and be considered for each one independantly.

Q: How many submissions can I make?
A: You can only make 1 submission per day. In total, you can make only 10 submissions to the test set for each track.

Q: Can I train on other external datasets (public, or not)?
A: Only for the OPEN tracks. Not for the CLOSED tracks.

Q: Can I use data augmentation?
A: Yes, you can use any kind of noise or music, as long as you are not training on additional speech data, for the CLOSED tracks. You may also use the MUSAN noise dataset as augmentation for the CLOSED tracks. For the OPEN track, you can train on any data you see fit.

Q. Can I participate in the challenge but not submit a report describing my method?
A. We do not allow that option. Entries to the challenge will only be considered if a technical report is submitted on time. This should not affect later publications of your method if you restrict your report to 2 pages including references. You can still submit to the leaderboard, however, even if you do not submit a technical report.

Q. Will the technical report submitted to this workshop be archived by Interspeech 2022?
A. No. We shall use the papers to select some authors to present their work at the workshop.

Q. Will there be prizes for the winners?
A. Yes, there will be cash prizes for the top 3 on the leaderboard for each track.

Q. For the CLOSED condition, can I use the validation set for training anything, eg. the PLDA parameters?
A. No, for the CLOSED condition you can use the validation set only to tune user-defined hyperparameters, eg. for example selecting which convolutional model to use.

Q. For the CLOSED conditions, what can I use as the validation set?
A. For the closed conditions, participants may only use the provided pairs for this year's challenge, or the VoxCeleb1 pairs. These must strictly NOT be used for training. It is beneficial for participants to use this year's provided validation pairs, as their distribution matches that of the hidden test pairs.

Q. What kind of supervision can I use when training without labels in the semi-supervised track?
A. Self-supervision is an increasingly popular field of machine learning which does not use manually labelled training data for a particular task. The supervision for training instead comes from the data itself, for example from the future frames of a video or from another modality, such as faces.

Q. For the semi-supervised track, when I am training on the large set of target domain data without labels, can I use the total number of speakers in the CnCeleb2 dev set as a hyperparameter?
A. No, you cannot use any speaker identity information at all. You cannot use the number of speakers in any way, e.g. to determine the number of clusters for a clustering algorithm.

Q. What if I have an additional question about the competition?
A. If you are registered in the CodaLab competition, please post your question in the competition forum (rather than contact the organizers directly by e-mail) and we will answer it as soon as possible. The reason for this approach is that others may have similar questions: use of the forum ensures that the question can be useful for everyone. If you rather make your question before registering, please follow the procedure in the Organisers section below.

Organisers

Jaesung Huh, VGG, University of Oxford,
Andrew Brown, VGG, University of Oxford,
Arsha Nagrani, Google Research,
Joon Son Chung, KAIST, South Korea,
Andrew Zisserman, VGG, University of Oxford,
Daniel Garcia-Romero, AWS AI,
Jee-Weon Jung, Naver Corporation, South Korea

Advisors

Mitchell McLaren, Speech Technology and Research Laboratory, SRI International, CA,
Douglas A Reynolds, Lincoln Laboratory, MIT.

Please contact jaesung[at]robots[dot]ox[dot]ac[dot]uk or abrown[at]robots[dot]ox[dot]ac[dot]uk if you have any queries, or if you would be interested in sponsoring this challenge.

We thank the authors of CN-Celeb for their help and support.

Sponsors

VoxSRC is proudly sponsored by Naver/Line.