Danmaku Text Clustering Algorithm Based on Feature Extension and Word-Pair Filtering OBTM

keywords: Danmaku text, short text clustering, feature extension, OBTM, new word discovery
The danmaku text clustering is a hot topic in online video reviews. Given the problem of unsatisfactory clustering accuracy caused by short text and many new words, the danmaku text clustering algorithm based on feature extension and word-pair filtering OBTM is proposed. First, a new-word discovery algorithm based on weight optimization is proposed to retain the features of new words in the danmaku text. Then, the internal information and external knowledge of new words are used to expand the features of the danmaku text for reduced feature sparsity. Furthermore, the OBTM topic model based on word-pair filtering is designed to eliminate noise features. Finally, the Single-Pass algorithm based on cluster center iteration is proposed to obtain the clustering results of topic feature words. Experimental results show that the algorithm proposed in this paper is 13.33 %, 8.52 %, 6.25 % higher than the OBTM, Word2vec+BTM, OurE.Drift* algorithm, respectively, in terms of clustering accuracy.
reference: Vol. 41, 2022, No. 3, pp. 788–812