The 8th International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018)

Name: The 8th International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018)
Start: 2018-09-10T08:00:00+03:00
End: 2018-09-14T19:05:00+03:00
Location: No location set

10–14 Sept 2018

Europe/Moscow timezone

Support

grid2018@jinr.ru

Building corpora of transcribed speech from open access sources

13 Sept 2018, 14:45

15m

406A

Sectional reports 11. Big data Analytics, Machine learning 11. Big data Analytics, Machine learning

Anna Shaleva (Saint-Petersburg State University)

Currently there are hardly any open access corpora of transcribed speech in Russian that can be effectively used to train those speech recognition systems that are based on deep neural networks—e.g., DeepSpeech. This paper examines the methods to automatically build massive corpora of transcribed speech from open access sources in the internet, such as radio transcripts and subtitles to video clips. Our study is focused on a method to build a speech corpus using the materials extracted from the YouTube video hosting. YouTube provides two types of subtitles: those uploaded by a video’s author and those obtained through automatic recognition by speech recognition algorithms. Both have their specifics: author subtitles may have timing inaccuracies, while automatically recognized subtitles may have recognition errors. We used the YouTube Search API to obtain the links to various Russian-language video clips with subtitles available—words from a Russian dictionary served as an input. We examined two strategies to extract audio recordings with transcripts corresponding to them: by using both types of subtitles or only those that were produced through automatic recognition. The voice activity detector algorithm was employed to automatically separate the segments. Our study resulted in creating transcribed speech corpora in Russian containing 1000 hours of audio recordings. We also assessed the quality of obtained data by using a part of it to train a Russian-language automatic speech recognition system based on DeepSpeech architecture. Upon training, the system was tested on a data set consisting of audio recordings of Russian literature available on voxforge.com—the best WER demonstrated by the system was 18%.

Anna Shaleva (Saint-Petersburg State University) G. Fedoseev (Saint-Petersburg State University) Oleg Iakushkin (Saint-Petersburg State University)

Slides

GRID_Shaleva_part2.pdf

The 8th International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018)

Support

Building corpora of transcribed speech from open access sources

406A

Speaker

Description

Authors

Presentation materials

Choose timezone

The 8th International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018)

Support

Speaker

Description

Authors

Presentation materials