METHODS FOR IDENTIFYING LEXICAL AND GRAMMATICAL DIFFERENCES IN MEDICAL APPLIED TEXTS Cover Image

Meetodeid tekstide leksikaalsete ja grammatiliste erinevuste tuvastamiseks meditsiiniliste tarbetekstide näitel
METHODS FOR IDENTIFYING LEXICAL AND GRAMMATICAL DIFFERENCES IN MEDICAL APPLIED TEXTS

Author(s): Raul Sirel
Subject(s): Language and Literature Studies
Published by: Eesti Rakenduslingvistika Ühing (ERÜ)
Keywords: corpus linguistics; text linguistics; text corpora; genre analysis; language technology

Summary/Abstract: This paper introduces some transparent statistical methods for identifying characteristics distinctive for patient information and specification leaflets for human medicines. Though the patient information leaflets and specifications for human medicines have been published by the Estonian State Agency of Medicines and been digitally available for some time, they have not been linguistically analysed nor used in the development of language technology applications. It has been generally accepted that improving the quality of language technology applications often requires genre-specific approaches, for it is common that a model trained on one genre does not produce equally good results when applied to some other genre. It is the aim of the present paper to identify the linguistic features that differentiate the patient information leaflets and specifications for human medicines from each other and from language represented in the Balanced Corpus of Estonian. In order to achieve that, two text corpora containing the texts from 3977 patient information leaflets and 3977 specifications for human medicines have been created and statistically compared with each other and the Balanced Corpus of Estonian. The comparison of the corpora revealed that patient information leaflets and specifications for human medicines contain relatively limited lexicon compared with the Balanced Corpus. This knowledge is relevant, because confined lexicons tend to facilitate the tasks of information mining, automatic summarisation, etc. Furthermore, it appeared that the language in patient information leaflets was somewhat similar (compared to the language in specification leaflets) to the language represented in the Balanced Corpus. Indubitably the collected corpora of patient information leaflets and specifications for human medicines are valuable resources and should be subjects for further research.

  • Issue Year: 2013
  • Issue No: 9
  • Page Range: 265-278
  • Page Count: 14
  • Language: Estonian
Toggle Accessibility Mode