Studying the impact of morphological parameters on text readability using statistical analysis methods


The paper addresses one of the important aspects of text complexity, namely the dependency of text readability on a set of morphological and text surface metrics such as the average length of words, sentences, etc. The correlation between the objective text complexity which is specified by quantitative parameters of the linguistic features and the subjective text complexity, i.e. the difficulty of text comprehension as a psychological phenomenon, is analyzed. To assess the morphological text complexity we used an annotated dataset consisting of 1000 online news texts (140000 tokens) retrieved from the websites of Russian universities. For each text unit the ratio of each part-of-speech per token is measured. Online news texts of the dataset were also assessed by a target audience of the website, i.e. applicants, undergraduate and postgraduate students. As a result, the dataset was automatically annotated based on text linguistic features and human-labelled based on experts’ estimates of text readability on a 5-point scale. To assess the significance of morphological metrics and their influence on text readability, the correlation and regression analysis was carried out. To automatically classify a text as ‘easy-to-read’ or not ‘easy-to-read’, both single feature and compound models including more than one metric were constructed. In agreement with the prior research the most common metrics influencing text readability appear to be text surface characteristics. However, the proposed models also made it possible to establish the significance of morphological parameters, used both in single feature and compound models, such as the use of participles, nouns in the genitive case, adjectives and numerals, which should be taken into account in analyzing news text readability. Moreover, novel formulae for assessing readability were proposed based on the studied coefficients.