THE USE OF RUSSIAN-LANGUAGE INTERNET NEWS CORPORA FOR THE PURPOSES OF AUTOMATIC SPEECH RECOGNITION SYSTEMS IN THE AREA OF THE MEDIA MONITORING Cover Image

WYKORZYSTANIE KORPUSÓW ROSYJSKOJĘZYCZNYCH NEWSÓW INTERNETOWYCH NA POTRZEBY SYSTEMÓW AUTOMATYCZNEGO ROZPOZNAWANIA MOWY W OBSZARZE MONITORINGU MEDIÓW
THE USE OF RUSSIAN-LANGUAGE INTERNET NEWS CORPORA FOR THE PURPOSES OF AUTOMATIC SPEECH RECOGNITION SYSTEMS IN THE AREA OF THE MEDIA MONITORING

Author(s): Daniel Borysowski
Subject(s): Media studies, Eastern Slavic Languages, ICT Information and Communications Technologies
Published by: Polskie Towarzystwo Rusycytyczne
Keywords: internet news corpora; language modeling; speech recognition; ASR; media monitoring;

Summary/Abstract: The author of the article used open Internet-news corpuses (NewsRu and Taiga) to create N-gram language models for the needs of automatic speech recognition systems. The models were comprehensively evaluated (perplexity, WER, proper name recognition, comparison with the base model and Google ASR). The author also rescored N-gram models, using recursive neural networks. The effectiveness of the models was assessed by recognizing speech from the news channel Россия 24 (37 files with a total length of 1.5 hours were tested). The selection of test data is related to the main goal of the article - speech recognition for the needs of the so-called media monitoring.

  • Issue Year: 2022
  • Issue No: 177
  • Page Range: 32-54
  • Page Count: 23
  • Language: Polish