Web crawling dla cel&#243;w lingwistycznych. Wybrane aspekty gromadzenia i analizy danych tekstowych na przykładzie rosyjskojęzycznych news&#243;w internetowych

Daniel Borysowski

Web crawling dla celów lingwistycznych. Wybrane aspekty gromadzenia i analizy danych tekstowych na przykładzie rosyjskojęzycznych newsów internetowych
Web crawling for linguistic purposes. Selected aspects of collecting and analyzing text data on the example of Russian-language Internet news

Author(s): Daniel Borysowski
Subject(s): Language and Literature Studies
Published by: Wydawnictwo Uniwersytetu Warmińsko-Mazurskiego w Olsztynie
Keywords: web crawling, corpus of text files; Internet news; text delimiter; quote; re-product; multi-word expressions

Summary/Abstract: The author of the article collected nearly 2.7 million excerpts of Russian-language Internetnews. The main objectives of the article include: discussing the concept of web crawlingin relation to the acquisition of online text data, addressing issues related to structuringsuch data in unannotated text corpora, as well as presenting selected aspects of analyzingdata structured this way. The author considers Internet news to be a combination of themain text and metadata that identifies and characterizes it (acquired during automaticextraction from websites). The categorization of news into the main text and metadatacreates an opportunity to analyze it from two perspectives – textual and meta-information (and an additional perspective that combines these two, for example for the purpose ofchronological studies). An outline of possible linguistic research into the collected materialis supplemented with evaluating selected multi-word tokens extracted from these textsbased on the delimitation function of quotation marks.

Details
Contents

Journal: Prace Językoznawcze

Issue Year: 23/2021
Issue No: 3
Page Range: 87-104
Page Count: 17
Language: Polish

Content File-PDF

Back to list

Web crawling dla celów lingwistycznych. Wybrane aspekty gromadzenia i analizy danych tekstowych na przykładzie rosyjskojęzycznych newsów internetowych Web crawling for linguistic purposes. Selected aspects of collecting and analyzing text data on the example of Russian-language Internet news

Web crawling dla celów lingwistycznych. Wybrane aspekty gromadzenia i analizy danych tekstowych na przykładzie rosyjskojęzycznych newsów internetowych
Web crawling for linguistic purposes. Selected aspects of collecting and analyzing text data on the example of Russian-language Internet news