Outline
- Abstract
- Keywords
- 1. Introduction
- 2. Related Work
- 2.1. Support Vector Machines for Classification Problem
- 2.2. Semantic Kernels for Text Classification
- 2.3. Term Weighting Methods
- 2.4. Helmholtz Principle from Gestalt Theory and Its Applications to Text Mining
- 3. Class Meanings Kernel (cmk)
- 4. Experiment Setup
- 5. Experimental Results and Discussion
- 6. Conclusions and Future Work
- Acknowledgment
- References
رئوس مطالب
- چکیده
- مقدمه
- کارهای مربوطه
- ماشین های پشتیبانی بردار برای مسئله دسته بندی
- تابع معنایی برای دسته بندی متون
- روش وزن و ارزیابی عبارات
- اصل Helmholtz،برگرفته شده از نظریه Gestalt، بر مبنای داده کاوی می باشد
- کرنل معنایی کلاس ها، طبقات (CMK)
- تنظیمات آزمایشی
- نتایج و کارهای آینده
Abstract
Text categorization plays a crucial role in both academic and commercial platforms due to the growing demand for automatic organization of documents. Kernel-based classification algorithms such as Support Vector Machines (SVM) have become highly popular in the task of text mining. This is mainly due to their relatively high classification accuracy on several application domains as well as their ability to handle high dimensional and sparse data which is the prohibitive characteristics of textual data representation. Recently, there is an increased interest in the exploitation of background knowledge such as ontologies and corpus-based statistical knowledge in text categorization. It has been shown that, by replacing the standard kernel functions such as linear kernel with customized kernel functions which take advantage of this background knowledge, it is possible to increase the performance of SVM in the text classification domain. Based on this, we propose a novel semantic smoothing kernel for SVM. The suggested approach is based on a meaning measure, which calculates the meaningfulness of the terms in the context of classes. The documents vectors are smoothed based on these meaning values of the terms in the context of classes. Since we efficiently make use of the class information in the smoothing process, it can be considered a supervised smoothing kernel. The meaning measure is based on the Helmholtz principle from Gestalt theory and has previously been applied to several text mining applications such as document summarization and feature extraction. However, to the best of our knowledge, ours is the first study to use meaning measure in a supervised setting to build a semantic kernel for SVM. We evaluated the proposed approach by conducting a large number of experiments on well-known textual datasets and present results with respect to different experimental conditions. We compare our results with traditional kernels used in SVM such as linear kernel as well as with several corpus-based semantic kernels. Our results show that classification performance of the proposed approach outperforms other kernels.
Keywords: Higher-order relations - Meaning - Semantic kernel - Support vector machines - Text classification6. Conclusions and future work
We introduce a new semantic kernel for SVM called Class Meanings Kernel (CMK). The CMK is based on meaning values of terms in the context of classes in the training set. The meaning values are calculated according to the Helmholtz Principle which is mainly based on Gestalt theory and has previously been applied to several text mining problems including document summarization, and feature extraction (Balinsky et al., 2010, 2011a, 2011b, 2011c).
Gestalt theory points out that meaningful features and interesting events appears in large deviations from randomness. The meaning calculations attempt to define meaningfulness of terms in text by using the human perceptual model of the Helmholtz principle from Gestalt Theory. In the context of text mining, the textual data consist of natural structures in the form of sentences, paragraphs, documents, topics and in our case classes of documents. In our semantic kernel setting, we compute meaning values of terms, obtained using the Helmholtz principle in the context of classes where these terms appear. We use these meaning values to smoothen document term vectors. As a result our approach can be considered as a supervised semantic smoothing kernel which makes use of the class information. This is one of the important novelties of our approach since the previous studies of semantic smoothing kernels does not incorporate class specific information.
Our experimental results show the promise of the CMK as a semantic smoothing kernel for SVM in the text classification domain. The CMK performs better than commonly used kernels in the literature such as linear kernel, polynomial kernel and RBF, in most of our experiments. The CMK also outperforms other corpus-based semantic kernels such as IHOSK (Altınel et al., 2014a) and HOTK (Altınel et al., 2014b), in most of the datasets. Furthermore, the CMK forms a foundation that is open to several improvements. For instance, the CMK can easily be combined with other semantic kernels which smooth the document term vectors using term to term semantic relations, such as the ones using WordNet or Wikipedia. As future work, we would like to analyze and shed light on how our approach implicitly captured semantic information in the context of a class when calculating the similarity between two documents. We also plan to implement different class-based document or term similarities in supervised classification and further refine our approach.