Terra Linguistica

EDI RUS 7-10 0000-0002-6425-2050 National Research University Higher School of Economics Kolmogorova Anastasia nastiakol@mail.ru

St. Petersburg, Russian Federation

Engineering linguistic technologies in text studies The publication is devoted to the analysis of the current state of engineering linguistics, its main directions and research challenges. The definition of language technologies and their typology are formulated according to the criterion of the tasks solved with their help. It is noted that the national school of engineering linguistics manages to maintain a balance between technological and linguistic research. 10.18721/JHSS.14101 81'32, 81'33 языковые технологии инженерная лингвистика компьютерная лингвистика языковые модели https://human.spbstu.ru/article/2023.51.1/ 7-10.pdf

RAR RUS 11-20 0000-0001-7580-4386 Smolensk State University Andreev Vadim vadim.andreev@ymail.com

Smolensk, Russian Federation

Evolution of Vladimir Nabokov’s image system: quantitative analysis The article deals with the evolution of such important aspect of Vladimir Nabokov’s individual style as image system. Navokov is much better known as prose writer. However, he began his creative career as a poet and continued to write verse all his life. The utilized traditional approach to the definition of image as a transfer of conceptual characteristics makes it possible to carry out quantitative analysis of the concepts used by the author and to compare their quantity in early and mature periods of creative activity. Multivariate discriminant analysis is used as a statistical method to differentiate between the periods simultaneously on the basis of a larger number of variables (frequencies of concepts in the function of source domain). The obtained results demonstrate that there are significant changes in the individual style of the poet, and, consequently, in his worldview. The obtained discriminant model, which includes eight characteristics (concepts marking style alteration), makes it possible to automatically attribute the text to the right period in 100% of cases. Qualitative analysis of changes in the frequencies of concepts reveals a complex polyphonic character of style alteration, which includes both the transition from complicated to simpler concepts and a change to more complex understanding of living beings. 10.18721/JHSS.14102 811'32 stylochronometry individual style image system discriminant analysis Nabokov https://human.spbstu.ru/article/2023.51.2/ 11-20.pdf

RAR RUS 21-29 0000-0003-2856-5049 St. Petersburg State University Grebennikov Alexander agrebennikov@mail.ru

St. Petersburg, Russian Federation

0000-0002-3347-1373 St. Petersburg State University Marusenko Natalya n.marusenko@spbu.ru

St. Petersburg, Russian Federation

0000-0002-7825-1120 St. Petersburg State University Skrebtsova Tatyana t.skrebtsova@spbu.ru

St. Petersburg, Russian Federation

Mapping word frequencies in fiction on sociopolitical context: the case of early 20th century Russian short stories The paper deals with the language of Russian short stories written in the period from 1900–1930. It is based on the Russian Short Stories Corpus, an ongoing research project aimed to collect, digitally process, and present the Russian literature of the early 20th century in an electronic form. The Corpus contains the stories written by thousands of Russian authors, both well-known and almost forgotten ones. From the corpus, a sample was taken to serve as a testbed for linguists, lexicographers and literary scholars, enabling them to check their intuitions concerning the language and style of the epoch. The sample has been divided into three subsamples along the lines set by the dramatic turns of Russian history. The first subsample contains the stories produced from the onset of the 20th century up to WWI (1900–1913), the second one refers to the tumultuous period of wars and revolutions (1914–1922), and the third accounts for the stories written in the Soviet Union (1923–1930). The Corpus has proved instrumental in detecting manifold changes in language use, including grammar, vocabulary, syntactic patterns, collocations, and stylistics. In the present paper, frequency-sorted word lists are used to bring out relevant changes in Russian vocabulary, linking them to the sociopolitical context. The results obtained will provide valuable data for the lexicographers compiling Russian dictionaries of the above-mentioned period. 10.18721/JHSS.14103 81'33 Russian short stories text corpus frequency dictionary Russian lexicography stylometry https://human.spbstu.ru/article/2023.51.3/

RAR RUS 30-40 0000-0001-5338-3656 Peter the Great St. Petersburg Polytechnic University Evtushenko Tatiana evtushenkotg@gmail.com

St. Petersburg, Russian Federation

6523-2016 57189038663 0000-0002-6326-8392 St. Petersburg Electrotechnical University “LETI” Klochkova Yelena esklochkova@etu.ru

St. Petersburg, Russian Federation

National Research Tomsk State University Laputenko Andrey laputenko.av@gmail.com

Tomsk, Russian Federation

0000-0002-4006-1161 Institute for System Programming of the Russian Academy of Sciences Evtushenko Nina evtushenko@ispras.ru

Moscow, Russian Federation

Studying the impact of morphological parameters on text readability using statistical analysis methods The paper addresses one of the important aspects of text complexity, namely the dependency of text readability on a set of morphological and text surface metrics such as the average length of words, sentences, etc. The correlation between the objective text complexity which is specified by quantitative parameters of the linguistic features and the subjective text complexity, i.e. the difficulty of text comprehension as a psychological phenomenon, is analyzed. To assess the morphological text complexity we used an annotated dataset consisting of 1000 online news texts (140000 tokens) retrieved from the websites of Russian universities. For each text unit the ratio of each part-of-speech per token is measured. Online news texts of the dataset were also assessed by a target audience of the website, i.e. applicants, undergraduate and postgraduate students. As a result, the dataset was automatically annotated based on text linguistic features and human-labelled based on experts’ estimates of text readability on a 5-point scale. To assess the significance of morphological metrics and their influence on text readability, the correlation and regression analysis was carried out. To automatically classify a text as ‘easy-to-read’ or not ‘easy-to-read’, both single feature and compound models including more than one metric were constructed. In agreement with the prior research the most common metrics influencing text readability appear to be text surface characteristics. However, the proposed models also made it possible to establish the significance of morphological parameters, used both in single feature and compound models, such as the use of participles, nouns in the genitive case, adjectives and numerals, which should be taken into account in analyzing news text readability. Moreover, novel formulae for assessing readability were proposed based on the studied coefficients. 10.18721/JHSS.14104 81.32 text complexity readability morphological features media text correlation and regression analysis https://human.spbstu.ru/article/2023.51.4/ 30-40.pdf

RAR RUS 41-56 Herzen State Pedagogical University of Russia Kamshilova Olga onkamshilova@gmail.com

St. Peterburg, Russian Federation

Herzen State Pedagogical University of Russia Belyaeva Larisa belyaevaln@herzen.spb.ru

St. Petersburg, Russian Federation

Machine translation in the age of digitalization: new practices, procedures and resources In Russian linguistics digitalization is traditionally associated with the use of mathematical and computer methods applied mainly to text processing problems in various automated systems. The article analyzes the impact of digitalization on the use and purpose of machine translation (MT) systems in modern conditions. It describes new practices of using MT products both by professional translators and by a general MT system user for their individual purpose. It highlights the objective advantages and disadvantages of MT application from the point of view of practicing professional translators and ordinary users as well. The article considers a translator’s new working conditions, their new roles and skills determined by the impact of digitalization on working with text. It pays special attention to post-editing MT products as a translator’s new professional activity, which is needed to ensure high-quality translation and to extract correct information. It also describes the necessary and sufficient post-editing procedures to be performed by non-professional users while pursuing their own goals through MT application. Finally, the research focuses on the analysis of procedures and available linguistic resources that can optimize working with MT systems. 10.18721/JHSS.14105 8'33 digitalization machine translation (MT) MT systems post-editing translation practices linguistic resources https://human.spbstu.ru/article/2023.51.5/ 41-56.pdf

RAR RUS 57-69 M-9533-2013 56088078800 0000-0001-9085-0284 St. Petersburg State University Khokhlova Maria m.khokhlova@spbu.ru

St. Petersburg, Russian Federation

Learner corpora: relevant information and an overview of the existing frameworks In the modern world, there is a constant interest in foreign languages. Therefore, the question of learning about the language used by non-native speakers of a certain language, as well as describing their mistakes is a highly relevant matter. Learner corpora differ not only according to the languages they focus on, but also in relation to a number of their properties. The purpose of the study is to present a review the learner corpora available for different languages, as well as to compare the approaches that exist for their annotation. The paper considers the origins of learner corpus research, focuses on the main the stages of a project, types of learner corpora (which may differ in their tasks, students’ mother tongue, language proficiency, text genre, data type, etc.), linguistic and metatextual information that accompany texts and provides a classification of errors. The paper gives a brief overview of annotation tools and corpus platforms that can be used for building a learner corpus. 10.18721/JHSS.14106 81'32 learner corpora typology errors annotation second language acquisition https://human.spbstu.ru/article/2023.51.6/

RAR RUS 70-87 0000-0002-3008-5514 St. Petersburg State University Mitrofanova Olga A. o.mitrofanova@spbu.ru

St. Petersburg, Russian Federation

St. Petersburg State University Athugodage Mark m.athugodage@yahoo.com

St. Petersburg, Russian Federation

Dynamic topic modelling of the russian legal text corpus The article is devoted to the dynamic topic modelling analysis of legislative acts, decrees of senior officials and resolutions of the Supreme and Constitutional Courts dated 2008–2022, included into the research corpus of Russian legal documents. The article describes the procedures of corpus construction and preprocessing, training of topic models on this corpus. We consider both standard topic model and a dynamic topic model that takes into account changes in topics over time. After training the models in various conditions, a set of optimal training parameters was determined. The BERTopic library was used as the main tool for topic modelling, combining algorithms for constructing topic models and contextualized neural network models of distributed vectors. The research data may be of interest both for specialists in the field of computational linguistics as well as for sociologists, political scientists, lawyers working with legislative documents. 10.18721/JHSS.14107 81'32, 81'33 topic modelling dynamic topic model BERTopic Russian corpus of legal documents Russian gazette https://human.spbstu.ru/article/2023.51.7/ 70-87.pdf

RAR RUS 88-97 0000-0002-8815-7920 Petrozavodsk State University Rogov Alexander

Petrozavodsk, Russian Federation

0000-0001-5556-5349 Petrozavodsk State University Moskin Nikolai moskin@petrsu.ru

Petrozavodsk, Russian Federation

0000-0001-9939-9389 Petrozavodsk State University Lebedev Alexander perevodchik88@yandex.ru

Petrozavodsk, Russian Federation

On the paradigm shift of the author's invariant One of the urgent tasks in philology is the attribution of texts. The quantitative indicator by which one can distinguish between the works of different authors should be called the author's invariant. The paper describes a number of studies (the method of G. Kjetsaa, the method of evaluating the pair connection of grammatical classes, the method of "decision trees", the Delta method), the results of which confirm that the initial definition of the author's invariant should be corrected. In particular, this applies to the time interval during which the attribution parameter should keep “constant value”. It does not necessarily coincide with the entire period of the writer's work. Also due to the lack of a universal criterion that uniquely distinguishes a particular writer from others, one should use a set of characteristics of author's invariants at different levels of the language. The performed analysis shows that the term "author's invariant" should be divided into two categories – "global author's invariant" and "local author's invariant" – which can be consistently studied independently of each other. 10.18721/JHSS.14108 808.1 literary text linguostatistical parameter author's invariant classification data mining https://human.spbstu.ru/article/2023.51.8/ 88-97.pdf

CNF RUS 98-107 Herzen State Pedagogical University of Russia Kamshilova Olga onkamshilova@gmail.com

St. Peterburg, Russian Federation

Herzen State Pedagogical University of Russia Belyaeva Larisa belyaevaln@herzen.spb.ru

St. Petersburg, Russian Federation

J-2590-2015 57207357482 Herzen State Pedagogical University of Russia Piotrowska Xenia krp62@mail.ru

St. Petersburg, Russian Federation

Language engineering and applied linguistics today: The chronicle of the IV International conference “R. Piotrowski’s Readings – 2022” This chronicle provides an overview of the IV International Conference on Language Engineering and Applied Linguistics “R. Piotrowski’s Readings – 2022” held on November 22, 2022, in Herzen State University (St. Petersburg, Russia). The conference was organized to mark the 100th anniversary of Rajmund G. Piotrowski’s birth (1922–2009), a Russian scientist, professor, Honored scientist of Russia. R.G. Piotrowski was the founder of Language Engineering School, pioneer of MT in Russia, initiator of engineering-linguistic strategy in research and practical methodological work, and evidence-based paradigm in methodology of humanitarian research. The article presents a brief outline of R.G. Piotrowski’s scientific legacy. It focuses on various methodological approaches and research practices in the field of engineering and applied linguistics contributed by the conference participants. 10.18721/JHSS.14109 8'33 R.G. Piotrowski engineering linguistics applied linguistics corpus linguistics machine translation language training computer systems https://human.spbstu.ru/article/2023.51.9/ 98-107.pdf