SPEECH AND TEXT DATA PREPARATION FOR DEVELOPING
OF AN AUTOMATIC SPEECH RECOGNITION SYSTEM FOR THE KARELIAN LANGUAGE Cover Image

ПОДГОТОВКА РЕЧЕВЫХ И ТЕКСТОВЫХ ДАННЫХ ДЛЯ СОЗДАНИЯ СИСТЕМЫ АВТОМАТИЧЕСКОГО РАСПОЗНАВАНИЯ КАРЕЛЬСКОЙ РЕЧИ
SPEECH AND TEXT DATA PREPARATION FOR DEVELOPING OF AN AUTOMATIC SPEECH RECOGNITION SYSTEM FOR THE KARELIAN LANGUAGE

Author(s): Irina S. Kipyatkova, Alexandra P. Rodionova, Ildar A. Kagirov, Andrey A. Krizhanovsky
Subject(s): Language and Literature Studies, Pragmatics, Finno-Ugrian studies
Published by: Петрозаводский государственный университет
Keywords: Karelian language; Livvi-Karelian dialect; natural language automatic processing; speech recognition systems training; datasets; corpus linguistics;

Summary/Abstract: This paper addresses some aspects of collecting and preparing language data of the Livvi dialect of the Karelian language needed for training a system of automatic speech-to-text conversion. The importance of such technologies for the Karelian language derives from its status as a low-resource language, which is a serious obstacle to its study and preservation. The main tasks at the current stage of the research are to collect and annotate speech and text corpora, as well as to create a transcription dictionary. The speech corpus includes audio recordings of 15 speakers (6 men and 9 women). All the recordings were transcribed and segmented into single utterances. The volume of records after the removal of “junk” fragments was 3,5 hours. The volume of the text corpus after the removal of repeated sentences was over 5M word usages. Based on the collected text corpus, a dictionary was created, which will subsequently be used as a part of the Karelian speech recognition system. All the words included in the dictionary were automatically transcribed (phonemic transcription). In the further research collected text and speech data will be used for training and testing the Livvi-Karelian speech recognition system.

  • Issue Year: 45/2023
  • Issue No: 5
  • Page Range: 89-98
  • Page Count: 10
  • Language: Russian