WYKORZYSTANIE KORPUS&#211;W ROSYJSKOJĘZYCZNYCH NEWS&#211;W INTERNETOWYCH NA POTRZEBY SYSTEM&#211;W AUTOMATYCZNEGO ROZPOZNAWANIA MOWY W OBSZARZE MONITORINGU MEDI&#211;W

Daniel Borysowski

WYKORZYSTANIE KORPUSÓW ROSYJSKOJĘZYCZNYCH NEWSÓW INTERNETOWYCH NA POTRZEBY SYSTEMÓW AUTOMATYCZNEGO ROZPOZNAWANIA MOWY W OBSZARZE MONITORINGU MEDIÓW
THE USE OF RUSSIAN-LANGUAGE INTERNET NEWS CORPORA FOR THE PURPOSES OF AUTOMATIC SPEECH RECOGNITION SYSTEMS IN THE AREA OF THE MEDIA MONITORING

Author(s): Daniel Borysowski
Subject(s): Media studies, Eastern Slavic Languages, ICT Information and Communications Technologies
Published by: Polskie Towarzystwo Rusycytyczne
Keywords: internet news corpora; language modeling; speech recognition; ASR; media monitoring;

Summary/Abstract: The author of the article used open Internet-news corpuses (NewsRu and Taiga) to create N-gram language models for the needs of automatic speech recognition systems. The models were comprehensively evaluated (perplexity, WER, proper name recognition, comparison with the base model and Google ASR). The author also rescored N-gram models, using recursive neural networks. The effectiveness of the models was assessed by recognizing speech from the news channel Россия 24 (37 files with a total length of 1.5 hours were tested). The selection of test data is related to the main goal of the article - speech recognition for the needs of the so-called media monitoring.

Details
Contents

Journal: Przegląd Rusycystyczny

Issue Year: 2022
Issue No: 177
Page Range: 32-54
Page Count: 23
Language: Polish

Content File-PDF

Back to list

WYKORZYSTANIE KORPUSÓW ROSYJSKOJĘZYCZNYCH NEWSÓW INTERNETOWYCH NA POTRZEBY SYSTEMÓW AUTOMATYCZNEGO ROZPOZNAWANIA MOWY W OBSZARZE MONITORINGU MEDIÓW THE USE OF RUSSIAN-LANGUAGE INTERNET NEWS CORPORA FOR THE PURPOSES OF AUTOMATIC SPEECH RECOGNITION SYSTEMS IN THE AREA OF THE MEDIA MONITORING

WYKORZYSTANIE KORPUSÓW ROSYJSKOJĘZYCZNYCH NEWSÓW INTERNETOWYCH NA POTRZEBY SYSTEMÓW AUTOMATYCZNEGO ROZPOZNAWANIA MOWY W OBSZARZE MONITORINGU MEDIÓW
THE USE OF RUSSIAN-LANGUAGE INTERNET NEWS CORPORA FOR THE PURPOSES OF AUTOMATIC SPEECH RECOGNITION SYSTEMS IN THE AREA OF THE MEDIA MONITORING