CISUC

On Text-based Mining with Active Learning and Background Knowledge using SVM

Authors

Abstract

Text mining, intelligent text analysis, text data mining and knowledge-discovery in text are generally used aliases to the process of extracting relevant and non-trivial information from text.

Some crucial issues arise when trying to solve this problem, such as document representation and deficit of labeled data. This paper addresses these problems by introducing information from unlabeled documents in the training set, using the Support Vector Machine (SVM) separating margin as the differentiating factor.

Besides studying the influence of several pre-processing methods and concluding on their relative significance, we also evaluate the benefits of introducing background knowledge in a SVM text classifier. We further evaluate the possibility of actively learning and propose a method for successfully combining background knowledge and active learning.

Experimental results show that the proposed techniques, when used alone or combined, present a considerable improvement in classification performance, even when small labeled training sets are available.

Subject

Text Mining, Partially Labeled Data

Related Project

CATCH - Inductive Inference for Large Scale Data Bases Text CATegorization

Journal

Journal of Soft Computing - A Fusion of Foundations, Methodologies and Applications, Springer Verlag, Vol. 11, #6, pp. 519-530, Springer Verlag, January 2007

Cited by

Year 2011 : 1 citations

 Polajnar, T., Rogers, S., Girolami, M.
Protein interaction detection in sentences via gaussian processes: A preliminary evaluation
(2011) International Journal of Data Mining and Bioinformatics, 5 (1), pp. 52-72.

Year 2009 : 1 citations

 Classification of Protein Interaction Sentences via Gaussian Processes
T Polajnar, S Rogers, M Girolami - Pattern Recognition in Bioinformatics: 4th IAPR ?, 2009