Corpus compilation for digital humanities in lower–resourced languages: A practical look at compiling thematic digital media corpora in Serbian, Croatian and Slovenian Cover Image

Corpus compilation for digital humanities in lower–resourced languages: A practical look at compiling thematic digital media corpora in Serbian, Croatian and Slovenian
Corpus compilation for digital humanities in lower–resourced languages: A practical look at compiling thematic digital media corpora in Serbian, Croatian and Slovenian

Author(s): Ksenija D. Bogetić, Vuk Batanović, Nikola Ljubešić
Subject(s): Media studies, Lexis, South Slavic Languages
Published by: Hrvatsko filološko društvo
Keywords: corpus linguistics; corpus compilation; corpora and discourse analysis; digital media;

Summary/Abstract: The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted to for empirically–grounded social–scientific analysis (sometimes dubbed ‘corpus–assisted discourse analysis’ or ‘corpus–based critical discourse analysis’, cf. Hardt–Mautner 1995; Baker 2016). In the post–Yugoslav space, recent corpus developments have brought table–turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist – partly due to the fast–changing background of these issues, but also due to the fact that there is still a gap in the corpus method, and in guidelines for corpus compilation, when applied beyond the anglophone contexts. In this paper we aim to discuss some possible solutions to these difficulties, by presenting one step–by–step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic language contexts, including data scraping options, permissions and ethical issues, the factors that facilitate or complicate automated collection, and corpus annotation and processing possibilities. The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of South–Slavic corpus linguistics and corpus linguistic approaches in the humanities more generally.

  • Issue Year: 48/2022
  • Issue No: 94
  • Page Range: 129-152
  • Page Count: 24
  • Language: English
Toggle Accessibility Mode