Audio Wave
Lip Reading Sentences 3


The dataset consists of thousands of spoken sentences from TED and TEDx videos. There is no overlap between the videos used to create the test set and the ones used for the pre-train and trainval sets. The dataset statistics are given in the table below.

Set # videos # utterances # word instances Vocab
Pre-train 5,090 118,516 3.9M 51k
Trainval 4,004 31,982 358k 17k
Test 412 1,321 10k 2k

The Lip Reading Sentences 3 Languages (LRS3-Lang) dataset is an extended version of LRS3 (English-only) covering 13 different languages.


URLs and timestamps

For every sample we provide: i) the URL ('ref' entry in the text file) and frame ids of the original YouTube video it was created from, ii) the face detection bounding box for every frame, iii) the word boundary timestamps (pre-train set only). The frame numbers provided assume that the video is sampled at 25fps.

File MD5 Checksum
All sets d6a322038ce4fb2cd53742b28901070f

Video files

You can request the video files here.


The LRS3 dataset is available to download for research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.


Please cite the following if you make use of the dataset.

  • LRS3-TED: a large-scale dataset for visual speech recognition
    T. Afouras, J. S. Chung, A. Zisserman
    arXiv preprint arXiv:1809.00496