Web crawling for linguistic purposes. Selected aspects of collecting and analyzing text data on the example of Russian-language Internet news Cover Image

Web crawling dla celów lingwistycznych. Wybrane aspekty gromadzenia i analizy danych tekstowych na przykładzie rosyjskojęzycznych newsów internetowych
Web crawling for linguistic purposes. Selected aspects of collecting and analyzing text data on the example of Russian-language Internet news

Author(s): Daniel Borysowski
Subject(s): Language and Literature Studies
Published by: Wydawnictwo Uniwersytetu Warmińsko-Mazurskiego w Olsztynie
Keywords: web crawling, corpus of text files; Internet news; text delimiter; quote; re-product; multi-word expressions

Summary/Abstract: The author of the article collected nearly 2.7 million excerpts of Russian-language Internetnews. The main objectives of the article include: discussing the concept of web crawlingin relation to the acquisition of online text data, addressing issues related to structuringsuch data in unannotated text corpora, as well as presenting selected aspects of analyzingdata structured this way. The author considers Internet news to be a combination of themain text and metadata that identifies and characterizes it (acquired during automaticextraction from websites). The categorization of news into the main text and metadatacreates an opportunity to analyze it from two perspectives – textual and meta-information (and an additional perspective that combines these two, for example for the purpose ofchronological studies). An outline of possible linguistic research into the collected materialis supplemented with evaluating selected multi-word tokens extracted from these textsbased on the delimitation function of quotation marks.

  • Issue Year: 23/2021
  • Issue No: 3
  • Page Range: 87-104
  • Page Count: 17
  • Language: Polish