Named Entities in the German-Language Press: Corpus and Expert Analysis

Authors:
Abstract:

The analysis of proper names mentioned in news texts is of particular research interest, as it allows for indirect identification of the topics covered in the publications. This article presents the results of an analysis of an automatic procedure for extracting named entities using material from the German-language press. The study was conducted on both national German publications aimed at a broad audience and regional and local newspapers aimed at a narrower audience in the federal states of Germany. The work was conducted in two stages: during the first stage, entities belonging to one of three categories (anthroponyms, ergonyms, and toponyms) were extracted from the texts of each publication, as well as from the entire article collection, using the Stanza tool. Semantic networks reflecting the relationships between these entities were then constructed for the first 50 frequent lexemes. In the next stage of the work, the aforementioned proper names were subjected to expert analysis and subsequent clustering, which allowed, firstly, the identification of additional themes not identified in the previous step using the automated procedure, and secondly, the implementation of an in-depth analysis. The results show the prevalence of themes introduced into the media field related to the modern concept of political education in national press materials, while local themes were largely concentrated on the local agenda. Automatic identification of named entities can be considered a necessary step for subsequent discourse analysis, although the resulting material requires additional expert evaluation.