Supervised term-category feature weighting for improved text classification

Text classification is a central task in Natural Language Processing (NLP) that aims at categorizing text documents into predefined classes or categories. It requires appropriate features to describe the contents and meaning of text documents, and map them with their target categories. Existing text...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Attieh, Joseph (author)
مؤلفون آخرون: Tekli, Joe (author)
التنسيق: article
منشور في: 2022
الوصول للمادة أونلاين:http://hdl.handle.net/10725/15996
https://doi.org/10.1016/j.knosys.2022.110215
http://libraries.lau.edu.lb/research/laur/terms-of-use/articles.php
https://www.sciencedirect.com/science/article/pii/S0950705122013119
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
الوصف
الملخص:Text classification is a central task in Natural Language Processing (NLP) that aims at categorizing text documents into predefined classes or categories. It requires appropriate features to describe the contents and meaning of text documents, and map them with their target categories. Existing text feature representations rely on a weighted representation of the document terms. Hence, choosing a suitable method for term weighting is of major importance and can help increase the effectiveness of the classification task. In this study, we provide a novel text classification framework for Category-based Feature Engineering titled CFE. It consists of a supervised weighting scheme defined based on a variant of the TF-ICF (Term Frequency-Inverse Category Frequency) model, embedded into three new lean classification approaches: (i) IterativeAdditive (flat), (ii) GradientDescentANN (1-layered), and (iii) FeedForwardANN (2-layered). The IterativeAdditive approach augments each document representation with a set of synthetic features inferred from TF-ICF category representations. It builds a term-category TF-ICF matrix using an iterative and additive algorithm that produces category vector representations and updates until reaching convergence. GradientDescentANN replaces the iterative additive process mentioned previously by computing the term-category matrix using a gradient descent ANN model. Training the ANN using the gradient descent algorithm allows updating the term-category matrix until reaching convergence. FeedForwardANN uses a feed-forward ANN model to transform document representations into the category vector space. The transformed document vectors are then compared with the target category vectors, and are associated with the most similar categories. We have implemented CFE including its three classification approaches, and we have conducted a large battery of tests to evaluate their performance. Experimental results on five benchmark datasets show that our lean approaches mostly improve text classification accuracy while requiring significantly less computation time compared with their deep model alternatives.