Dataset preprocessing effects on Bi-LSTM-based concept tagging of text tokens

Authors:
Abstract:

The paper considers the problem of natural language dataset preprocessing to improve the neural network model performance. The aim of the study is to find out the dataset preprocessing parameters that ensure higher performance of the model aimed at correlating textual input (a sequence of lexical units) with semantic, or conceptual, classes, i.e. concept tagging. Our methodology includes: a) modeling conceptual annotation of textual units, b) experimenting with textual dataset preprocessing options. The model that we propose takes as input tokens (in lowercase) representing words and multi-component lexical units (phrases), some of which are domain concept related. Since each token may refer to several conceptual classes, the concept tagging task is treated as a multi-label classification problem. In this research, we deal with the corpus of news reports on terrorist attacks in English. We experimented with preprocessing the corpus-based dataset by: a) lemmatizing tokens, b) removing stop words, and c) including sentence separators as individual tokens in the model vocabulary. The multi-label classification model used for the training experiments was a neural network that constructs sequences of lexical unit embeddings and feeds them into a bidirectional long short-term memory (Bi-LSTM) model. The experimental results show that the dataset preprocessed according to all the above-mentioned procedures demonstrated the highest micro-, macro- and weighted averaged F1-scores. The per-class F1-score on the test dataset reaches 88% for the class characterized by high frequency and low lexical variability in the training, validation, and test samples. The novelty of the paper lies in the proposed approach to content analysis of news reports on terrorist attacks using the proposed multi-label classification model. New results were obtained during experimenting with the differently preprocessed corpora of news reports on terrorist attacks. The proposed method may be used for content analysis of news reports specific to other subject areas.