Experiments on automatic keyphrase extraction in stylistically heterogeneous corpus of Russian texts
The paper describes the experimental study of automatic keyphrase extraction techniques using expert assessments. The purpose of the study is to confirm the hypotheses on the location of keyphrases within a document and on the differentiation of keyphrases as regards applied algorithms and text styles. Experiments on automatic selection of keyphrases are carried out using nine algorithms of various types, including statistical (Log-Likelihood, TF-IDF, Chi-square), hybrid, also called linguostatistical (RAKE, YAKE, PullEnti, Topia), structural, also called graph-based (TextRank), and machine learning (KeyBERT). In the course of the study a mixed corpus was prepared of about 1 million tokens in size, including 50 social media texts (news reports with headlines), 50 scientific texts (articles on computational linguistics with titles, abstracts and manually specified sets of key expressions), 50 literary texts (chapters from prose works, provided with the author’s description of the content). Evaluation procedure implies comparison of keyphrases selected by experts from the first segment of texts and key expressions automatically extracted from the second segment. A quantitative assessment of the matches between expert and automatic markup made it possible to confirm the hypothesis on a different concentration of keyphrases in text segments involved in comparison. The study of lexico-grammatical and semantic features of keyphrases allowed us to reveal features that are determined by text style. The results of the study may improve semantic compression procedures performed using the methods of automatic keyphrase extraction.