Dimensionaalne tekstimudel: Teoreetiline ülevaade
The dimensional text model: A theoretical overview
Author(s): Kristiina Vaik, Kairit Sirts, Kadri MuischnekSubject(s): Applied Linguistics, Computational linguistics, ICT Information and Communications Technologies
Published by: SA Kultuurileht
Keywords: corpus linguistics; text classification; text typology; functional text dimensions; multidimensional analysis;
Summary/Abstract: Corpus linguists and language technologists are increasingly turning to the Web as a source of language data. However, automatically crawled corpora have some shortcomings: lots of data but the content is unknown. This has created a need for software which is able to extract all the necessary information from the raw corpus. One such information extraction task in natural language processing is automatic text classification, which in practice imposes several challenges, such as confusion around the terminology, the absence of a generally accepted taxonomy, etc. Even if the latter existed, Web corpora include noisy user-generated content with lots of variation, meaning that all this variety may not fit well into generally accepted taxonomies. In this article we propose a novel theoretical framework for text classification – the Dimensional Text Model (DTM). This approach does not depend on existing genres or genre taxonomies but rather relies on some text-external (function) and text-internal (linguistic features) criteria according to which the texts that express a similar set of linguistic features share a similar function. DTM is a combination of the Multidimensional Analysis (MDA, Biber 1988) and Functional Text Dimensions (FTD, Sharoff 2018). From MDA we adapt the concepts and definitions for text-internal criteria and dimension – dimension is a quantifiable measure of a set of co-occurring linguistic features. From FTD we adapt the notion of hybridism where instead of classifying a text belonging to a class (genre) or not, it characterizes a text through a combination of several parameters, i.e dimensions, and describes it in the space of dimensions. The aim of DTM is to propose a cross-linguistic universal model which does not depend on defining or classifying genres, instead it offers a framework for classifying texts based on their characteristic linguistic features, describing them in a single space of dimensions and interpret the function of these texts based on their location in the DTM space.
Journal: Keel ja Kirjandus
- Issue Year: LXIII/2020
- Issue No: 10
- Page Range: 875-898
- Page Count: 24
- Language: Estonian