Keyphrases in Russian-language popular science texts: comparison of oral and written speech perception with the results of automatic analysis
The process of transmitting information can be performed through oral and written speech. The mechanisms of perceiving written and spoken texts manifest themselves at different levels within the components of communication and comprehension of the text, including the level of keyphrases. Keyphrases provide essential information about a text in a compressed form, contributing to the structuring of texts, their classification and rapid assessment of the contents. The aim of this study is to analyze the differences that arise in perceiving the same text presented in written and oral forms. To accomplish this, we have examined both written and oral texts in Russian. The research involved the extraction of keyphrases both manually and automatically. This approach was chosen to determine algorithms that can approximate the mechanisms used by native speakers in selecting keyphrases. Experiments were performed on a dataset containing transcripts and audio recordings of lectures by Russian-speaking participants of the project “Postnauka”. The following algorithms were used for automatic keyphrase extraction from written texts: statistical (Log-Likelihood, T-test, PMI test, Chi-square), hybrid linguostatistical (RAKE, RuTermExtract, SpaCy), machine learning-based method (KeyBERT), and ChatGPT. Manual annotation was obtained through perceptual experiments involving Russian-speaking participants. Additionally, keyphrase distribution in the text structure was analyzed. The results obtained during the research on automatic processing and the results of perceptual experiments demonstrate a low level of agreement between extracted keyphrases. The study investigated the capabilities of various automatic extraction algorithms for keyphrases, as well as their limitations when used in the analysis of written and oral texts.Our observations suggest that in order to develop effective techniques for selecting keyphrases, it is essential to consider the typological features of the natural languages represented in the analyzed texts, the subject areas of the texts and the availability of appropriate linguistic and software tools. Additionally, there is evidence that the choice of a method to extract keyphrases should be based not only on criteria related to the frequency and stability of the keyphrases, but also to their perception.