![Лингвистични аспекти на компютърно опосредстваната комуникация, Биляна Тодорова. Благоевград, 2015](/api/image/getissuecoverimage?id=picture_2016_26126.jpg)
We kindly inform you that, as long as the subject affiliation of our 300.000+ articles is in progress, you might get unsufficient or no results on your third level or second level search. In this case, please broaden your search criteria.
Noun phrases in Croatian can differ in the degree of correlation between its constituents. Some constituents form a descriptive free word combinations (velik stol ʽlarge table’, sunčan dan ʽsunny day’, slatka kava ʽsweet coffee’, hladne ruke ʽcold hands’), while others form multiword units which concretize extra-linguistic content that can not be ex-pressed in one word (crna kava ʽblack coffee’, krevet na kat ‘bunk bed’, kreditna kartica ‘credit card’, radno mjesto ‘workplace’). Dependent constituents can be adjectives, which are congruent with a noun (velika soba ‘big room’, radno mjesto ‘working place’), or they can be adverb phrase or prepositional phrase (korak naprijed ‘step ahead’, mnogo ljudi ‘many people’, malo prijatelja ‘a few friends’, četkica za zube ‘toothbrush’, roba s greškom ‘faulty good’). This paper will analyze noun mreža (with reach syntagmatic and semantic potential) and its co-occurrences – they can either form a collocation or a free combination of words. The lexicographic description will be compared with the corpus-data. The analyses will take into consideration a list of computationally obtained collocates (collocation candidates) of a node noun. The frequency and the strength between the words occurring within a particular span can differ. The list of collocates obtained from the corpus will be taken into account and we will examine how it coincides with the existing lexicographic description and with theoretical principles of word combination interpretations in Croatian. The aim of the study is to determine how the corpus analysis can improve the treatment of word-combination entries in lexicographic work.
More...
This paper is a write-up of a keynote from El’Manuscript 2021, reflecting on the ways in which the field of computationally-supported medieval Slavic studies has and has not changed since the mid-2000’s. Looking towards developments in the broader fields of digital humanities and natural-language processing, it explores the ways that recent improvements in the tools at our disposal for mass digitization of manuscripts and text analysis at scale open up possibilities for working with manuscripts that have received very little attention. For these advancements to be feasible, however, scholars will need to prepare and share their digitized texts and annotations in ways that are not currently the norm, though a number of projects provide exemplary models of how these new conventions could be put into practice.
More...
The article contains some results of analyses of the Vienna part of the Codex Marianus (ÖNB, Vind. slav. 146), undertaken by an interdisciplinary group of scholars and scientists from the Centre of Image and Material Analysis in Cultural Heritage (CIMA ‒ www.cima@or.at) within two Austrian Science Fund-projects devoted to the ancient Glagolitic heritage. The investigation consisted of four parts, codicological, multispectral, chemical and philological. While the codicological survey served to get as much information as possible about the writing material (source of parchment, methods of preparation, writing process, deletions, condition), color and multispectral recordings had been made to preserve the manuscript at its best and to provide an apt basis for further investigations. The chemical analysis was executed with two portable spectroscopes (XRF and rFTIR) and aimed to get exact information on the parchment, the inks, paints and binders, and to collect data for a comparative study of parchment degradation. The philologists analysed the fragment comparatively with all other Old Church Slavonic-Glagolitic manuscripts preserved to get as much information as possible about their scribes.
More...
The paper defines the elementary principles for creating an electronic corpus of Serbian medieval charters and letters. The commitment to the principle of maximum representativeness of the corpus of medieval charters and letters, determined entirely by the preserved written legacy (based on manuscripts, microfilms or photographs), excludes the indispensability of applying the principle of balance, while simultaneously satisfying the principle of reliability, since charters and letters known solely by the edition are not included in the corpus. The selection of texts is done according to the diplomatic criterion by excluding the transcripts and copies of documents already available in the original, as well as later transcripts, chronologically and linguistically distant from the assumed original. This approach to the selection of texts is justified by the size of the corpus, as well as by the exceptional cultural and historical significance of medieval charters and letters. The definition of the metadata about corpus texts is determined by their general diplomatic properties, as well as the corpus search needs for diatopic, diachronic and genre variations. Conversion of texts into electronic form strives for fidelity to the original, encompassing the preservation of abbreviations, superscript letters and original punctuation, as well as the absence of accent marks and contemporary rules of capitalization.
More...
The article discusses a proposal of a minimal set of criteria for sentence segmentation (an obligatory stage in the corpus processing and annotation, especially with respect to the syntactic annotation) of medieval texts. In the context of a review of different definitions of a sentence (unit) and approaches to sentence segmentation, various criteria are discussed: structural, thematic, graphic, on the basis of sample sentences in order to define the minimal criteria. The discussion of the different factors is illustrated by sample sentences from two texts from 14th and 17th c. The proposed criteria aim at considering mainly structural characteristics while trying to avoid textual and semantic interpretation though these can also present challenges because the interpretation of the (syntactic) structure is inevitably related to the interpretation of the (semantic) content.
More...
The St Petersburg Corpus of Hagiographic Texts (SCAT) has launched two new mark-up formats. The first innovation is the comprehensive format developed for the division of hagiographic texts into parts, which are both explicitly marked as section headings and extrapolated through comparison with texts of the similar genre. The second innovation is an elaborate format representing the full range of various types of biblical, patristic and liturgical quotations occurring in the lives of saints. For the time being, three morphologically annotated manuscript texts have been marked up according to these guidelines, and we are planning to add two more texts in the near future. Close cooperation with the IHRIM research laboratory (Lyon) and wide use of their techniques and technology makes it possible to obtain some illuminating cross-format statistical data and thus offer new insights into the canons and rules of the Old Russian hagiography.
More...
Digital annotation of verbal aspect in Old Russian and Church Slavonic texts is a challenging and quite complicated task that requires a complex approach. While studying Slavic aspect systems synchronically, we always know whether the verb is perfective, imperfective or biaspectual, however, this is often not the case for the research of aspect in a diachronic perspective. The determination of the aspectual status of a particular verb for earlier stages is possible only after considering together different parameters such as: actionality, lexical semantic, morphology, functional distribution, syntactic restrictions, collocations, statistics etc. All essential parameters should be annotated sufficiently for an effective use of a corpora. That would enable a researcher to collect quickly the information necessary to build aspectual profile of a verb. It is also important to understand the hierarchy of the parameters, as they might have different degrees of importance, and for this purpose a special algorithm should be developed. The preliminary results, related to the parameters of annotation and the algorithm for aspect determination (using ‘Morphy’, the System for digital morphological annotation of Old Russian and Church Slavonic manuscripts, developed in Vinogradov Russian Language Institute RAS), are discussed in the paper.
More...
The article presents basic principles of designing the diachronic linguistic corpus of documents of the Don Cossack Host offices from the State Archive of the Volgograd region, Russia, including collecting documents for the text corpus, arranging the technical base of automatic processing and text editing, scheduling automated tagging, morphological annotation, and corpus software tools. The authors explain some technical aspects of corpus processing and text corpus constituency. It is considered reasonable to add any document to the corpus, the draft texts with the crossed-out fragments included, as it ensures accurate registration of grammar and vocabulary of the language at a certain historical period. A set of language marker types is worked over for automated meta-tagging. The corpus software tools are defined to enable accurate annotation of obsolete fonts so that they can be processed in a pair with regular language units and expressions in morphological and genre meta-tagging; in cases of partial text adaptation, the authentic old graphic symbols may have to be preserved.
More...
In cases where there is a larger collection of manuscripts, the scribe or author of which is unknown or in doubt, analyzing such manuscripts can take a lot of time and effort. The more pages and potential writers are involved, the more complicated it is to get tangible results. LiViTo is a free tool2 that requires a minimum of experience with the command line and allows a simplified search for keywords, revisions, and clustering of historical manuscripts. We present the application of LiViTo on the “lab case” of the biographies of Czech Protestant refugees from the 18th–19th century. Most of these manuscripts include stories of farmers’ and craftsmen’s families who fled to Berlin because of their religious beliefs. The examination of this type of biographies and manuscripts using the methods of Digital Humanities takes place for the first time for Czech. Using extracts from the research project in which LiViTo was developed, individual functions of the tool are explained. In addition, individual findings relating to the manuscripts and the potential further development of the tool are presented.
More...
The article deals with various efforts of the Staatsbibliothek zu Berlin (SBB) to make its collection of about 250 Church-Slavic prints from the 17th to the 19th century accessible in terms of content using the methods of modern information technology from the Digital Humanities sector. The focus is on full-text indexing of the heterogeneous Church Slavonic prints using HTR+ language models from the programme Transkribus. Depending on whether they are Moscow, Kiev or Old Believer prints, these models require different approaches and corresponding adaptations that take into account the printing area and printing period. Prints such as Kirillova kniga (1644) or Gistorija Ioanna Damaskina (1637) and many others are processed at large scale, whereby the developed character recognition models are constantly refined by training new verified data. The full texts generated in this way are permanently stored in various XML formats (ALTO, PAGE) on the one hand in a central repository for subsequent use, and on the other hand they are merged with original digital copies in the IIIF-compatible Digital Library of the SBB. As a further element, the Church Slavonic full texts will be indexed using special SOLR analyzers for efficient searches (Tokinising, Translit, N-Grams) and made searchable in subject portals (including the Slavistik-Portal) using modern text-image web design.
More...
The paper discusses some results obtained as part of an ongoing project at the Slavic Institute of Heidelberg University to produce automatic transcriptions of an early 18th century trilingual printed dictionary (Fedor Polikarpov’s Leksikon trejazyčnyj) and, on a preliminary basis, of a 17th century trilingual manuscript (Epifanij Slavineckii’s working copy of his Greek–Slavic–Latin dictionary) using the handwritten text recognition (HTR) platforms Transkribus and eScriptorium. It is argued that there are considerable advantages to employing such tools in terms of the simplification and acceleration of work on multilingual edition projects. Moreover, a comparison of our experience working with Transkribus and eScriptorium is given, along with an overview of the practical benefits and challenges of working with each of these platforms.
More...
We report on applying Handwritten Text Recognition (HTR) to manuscripts from the archive of Konstantin Rychkov preserved at IOM RAS, St. Petersburg, within the INEL project. Folklore texts in Evenki (Tungusic) were collected in Western Siberia in 1910s. We used services provided by the Transkribus platform. The necessary step of Layout Analysis proved to be time-consuming due to the organization of the parallel Evenki-Russian text on the page without following a strict separation line. HTR models have been trained successively on different amounts of data up to 521 pages. The best Character Error Rate attained on validation data for the largest dataset is 4.50% for models trained on all characters. The distribution of errors is non-uniform: most errors are due to just a few problematic issues, especially diacritics such as the accent marking stress. It is written high above the line and frequently cut off from the line images at the preprocessing stage. After excluding the stress mark from training data and recognition, the lowest CER dropped to 2.90%. We compared two recognition engines, HTR+ and PyLaia. The HTR+ model trained without stress marks made less errors in letters, while PyLaia performed better with respect to diacritics.
More...
The author compares the marginal glosses in the book of Epifanij Slavinetskij’s Sbornik perevodov, 1665, with the text of Athanasius’ Third Oration against the Arians in Gavrilo Venclović’s Razglagolnik, 1734. The marginal glosses in Epifanij’s Russian Version are taken from a South Slavonic manuscript that has a common origin with the protograph of Venclović. The Orationes contra Arianos in Razglagolnik are written in South Slavonic koine and their source has the features of an Athonite translation related to the Council of Ferrara-Florence and the disputes over the filioque.
More...
The text transmission of the Slavonic translation of Hippolytus’ De Christo et Antichristo presents a stable and well-testified tradition. It gives a base for possible reconstruction of the Greek original from which this translation was made. The article demonstrates some omissions, additions, and reconstructions on the Greek text compared to the Slavonic one. Also, the paper addresses significant problems that occur in the scholars’ work on bilingual dictionaries discussing possible approaches and solutions. Still, some questions remain, and it is not easy to suggest a definite answer to them. The author underlines the importance of the fragmentary copy of the Greek text, presented in the manuscript of Meteora 573, bearing in mind its significant correspondence to the Slavonic tradition. Unfortunately, this manuscript preserves only trifling fragments of the whole work by Hippolytus of Rome.
More...
The article focuses on Old Slavonic versions of Euthalian chapter-lists to Acts and Epistles considering meta-communicative terms, such as παραίνεσις or προοίμιον. The author aims to evaluate the level of accuracy of Slavonic translations and their exegetical potential, which makes the content of the main text of Acts and Epistles clear. The analysis reveals two tendencies prevailing in Slavonic sources from the 12th–16th centuries: first, there are phenomena of lexical variability, as results of applying various translation strategies, more or less successful in terms of the accuracy and clarity of the resulting text (calques, periphrastic constructions, and text expansion). Second, there is a tendency towards unification, suggesting a universal Slavonic term for several Greek correlates. Authoritative dictionaries, including academic ones, do not record some lexemes. There is no dependence of the chapter-lists lexicon on the main text vocabulary.
More...
The focus of this report is the still-unexplored Interpretation of Orthodox liturgy, attested in two copies: first in manuscript No. 88 from the collection of Obolensky (201), State Archive of Russian Federation (Moscow), the second in manuscript No 52 of 1567, from the Archive of Baltazar Bogisić in Cavtat. The two manuscripts contain proven original works of Constantine of Kostenets (1380–1431). The author analyzes the structure and content of the interpretation and comments on it as a source for the history of Liturgy – from the point of view of the data concerning the liturgical features described in it. It can be concluded that the basis of texts in MS No 88 and MS Bogishić 52 is a late composition of Byzantine mystagogy, which, in turn, means that the time of implementation of the South Slavic translation should be dated no earlier than the end of the 12th century. This is one of the many short epitomes created during the Second Bulgarian Kingdom as a result of the secondary reduction of the original extensive commentary. A detailed investigation and the text-critical edition will be forthcoming.
More...
The concept of stop words introduced by H. P. Lun in the mid-20th century plays a huge role in today’s NLP practice. Stop words are used to reduce noisy text data, remove uninformative words, speed up text processing, and minimize the amount of memory required to store data. The Kyrgyz language is an agglutinative Turkic language for which no scientific study of stop words has been previously published in English. In our study, we combined frequency analysis with rule-based linguistic analysis. First, we found the most frequently used words, set a threshold, and removed words below the threshold. This way we got a list of the most frequently used words. Then we reduced the list by excluding from the list all words that do not belong to the category of function words of the Kyrgyz language. Finally, we got a list of 50 words that can be considered stop words in the Kyrgyz language. In our analysis, we used a single corpus of sentences collected and posted as an open source project by one of the local broadcasters.
More...
The study introduces OnomOs, a new corpus of Czech texts with annotation of proper names. The corpus was compiled by onomasticians from the Department of Czech Language, Faculty of Arts, University of Ostrava, and made available by the Institute of the Czech National Corpus, Faculty of Arts, Charles University in Prague. The paper briefly discusses the content and structure of the corpus, the selection of texts for inclusion, and the onomastic-geographical classification of the identified names. The text consists chiefly of three preparatory analyses, which focus on the most frequent surnames, collocations found in Western and Eastern countries in the pre-1989 period, and the declension patterns of three types of onyms. In the summary, further possibilities of onomastic corpus research are presented.
More...
Cybersecurity is a rapidly developing domain, where emerging new concepts are usually first designated in English and then find their way into the usage of other languages. As the Lithuanian terminology in this domain develops, different types of synonymous terms appear in usage, which are treated differently by speakers. The article presents a terminology survey involving 593 respondents from various age groups, from different regions and expertise levels. In the survey, the respondents had to name the most suitable terms for 10 cybersecurity concepts: the respondents could choose the terms proposed in the questionnaire or they could propose their own terms and give the reasons why they made their choices. The concepts and their terminological designations were selected from the Lithuanian-English Cybersecurity Termbase, the dataset of which is based on bilingual parallel and comparable cybersecurity corpora. The quantitative and qualitative analysis of survey results reveals preferences for different types of terms, such as borrowings, metaphorical calques, and descriptive terms, and how these preferences differ across the two segments of respondents: students vs. graduates, and cybersecurity experts vs. general public. The results show that some terminological designations have been already established in the Lithuanian language, while most of them are still competing for their positions. The analysis of the reasons reveals that accuracy and clarity are the main factors for selecting a term. The research contributes to the standardisation of cybersecurity terms in Lithuania and provides insights into user preferences and the reasons behind them.
More...