of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Volume 1 Issue 8 (September 2014) _______________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved
    International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163   Volume 1 Issue 8 (September 2014 ) _______________________________________________________________________________________________________  © 2014, IJIRAE- All Rights Reserved Page - 163 HYBRIDIZATION OF EM AND SVM CLUSTERS FOR EFFICIENT TEXT CATEGORIZATION S.Arul Murugan, HOD, Assistant Professor, Dept of Computer Applications, Saradha Gangadharan College, Puducherry.  Research Scholar, Periyar University, Salem Dr. P. Suresh, Head of the Department, Dept of Computer Science, Salem Sowdeswari College (Govt. Aided), Salem.  Research Supervisor, Periyar University, Salem  Abstract— Text categorization is a dynamic research area in information retrieval and machine learning. Text classification is the task  of mechanically transmitting semantic categories to natural language texts and has become one of the key methods for systematize online information. Fuzzy Self-Constructing Feature Clustering (FSFC) Algorithm condenses the dimensionality of feature vectors by  membership function with statistical mean and deviation [1]. Yet, lexica are not included in order to make the preprocessing mechanism  more effective and do not categorize the multi-label results. A wide range of Support Vector Machine (SVM) cluster and Expectation  Maximization (EM) algorithm are introduced to solve the problem. Hybridization of EM algorithm and SVM cluster combines the  classification power to produce the multi-label categorization results by removing noise effectively. Initially, EM algorithm extracts the  potentially noisy article from the data set using the descending porthole technique. Descending porthole is a sliding window technique used  from the top to bottom of the article for preprocessing. Subsequently, SVM cluster establish the content holdup method which generates a  more efficient multi-label representation of the articles. Hybridization of EM algorithm and SVM cluster outperforms the Fuzzy Self-Constructing Feature Clustering Algorithm in terms of lexica inclusion and multi-label categorization of text results. The experimental  performance of Hybridization of EM algorithm and SVM cluster is evaluated with Dexter Data Set from UCI repository against existing  FSFC to attain lesser execution time, clustering efficiency, and increased net similarity score level in texts .   Keywords  —   Expectation Maximization, Support Vector Machine, Multi-label Representation, Feature Clustering, Fuzzy Self Constructing,  Preprocessing, Membership Function I.   INTRODUCTION Clustering is one of the traditional data mining techniques where clustering methods effort to distinguish intrinsic groupings of the text articles. A set of clusters is produced in which clusters exhibit high intra cluster resemblance and low inter cluster comparison. Clustering technique is used for verdict patterns in unlabelled data with numerous dimensions. Clustering has attracted interest from researchers in the field of data mining text categorization. The main advantage of clustering algorithm is the ability to learn from and detect similar data without explicit images. Text categorization is a primary task in information retrieval with rich body of information that has been accumulated. The normal approach to text categorization has so far been using a document symbol in a word based input space. That is, as a vector in some high dimensional Euclidean space, and then has been relying on several classification algorithms, trained in a supervised learning manner. Since the early days of text categorization, the theory and follow of classifier design has considerably superior and numerous strong leaning algorithms have emerged. In contrast, even though numerous attempts to initiate more sophisticated document representation techniques e.g.  based on higher order word statistics the simple minded independent word-based representation, known as Dexter Data Set from UCI repository stay very popular. Indeed, to-date the best multi-class, multi-labeled categorization results are based on the Dexter Data Set. A text classification assignment consists of the training phase and the text categorization phase. The former includes the feature extraction procedure and the indexing process. The vector space model has been used as a conservative method for text representation. The model represents a document as a vector of features using Term Frequency (TF) and Inverted Document Frequency (IDF). The model simply counts TF without considering where the term occurs. But each sentence in a article has different importance for identifying the content of the document. Thus, by assigning a different weight according to the importance of the sentence to each term, achieve better results. For upcoming problem, weights are differently weighted by the location of a term, so that the structural information of a document is applied to term weights. But FSFC method supposes that only numerous sentences, which are located at the front or the rear of a article, have the significant meaning. Hence it can be applied to only documents with a fixed form such as articles. The next step uses the title of an article in order to choose the important terms. The terms in the title are handled importantly. But a drawback is that some titles, which do not properly contain the meaning of the article, rather increase the ambiguity of the meaning. The case often appears out in documents with a familiar style such as Newsgroup and Email.  Normally, text document clustering methods effort to separate the documents into groups where each group represents several themes that is different than that theme represented by the other groups. Text classification aims at assigning class labels to text records. Text classification is based on multi word with support vector machine investigates beneficial effects [7] which achieved only appropriate information. A multi-word extraction method based on the syntactical rules of multi-word does not integrate the learning method with the characteristics of the document vectors.    International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163   Volume 1 Issue 8 (September 2014 ) _______________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved Page - 164   The noisy document is one of the foremost reasons of diminishing the performance for binary text classification. The classifiers need to professionally handle these noisy documents to attain the high performance. These noisy documents are one of the major causes of declining the performance for text classification. This work is related to develop an EM algorithm using the descending porthole technique which is used to efficiently handle these noisy documents to achieve the high performance. In contrast, the numerous greedy approaches for feature selection only consider each feature individually with single level label extraction. So to overcome that problem, SVM cluster achieved the content holdup method for providing a good solution to the statistical problem in data mining. II.   STATE OF ART Clustering algorithm by extending affinity propagation with a novel asymmetric similarity quantity that captures the structural information of texts. A semi supervised learning approach; develop the knowledge from a minute quantity of labeled objects but the generic seeds construction strategy is not developed [2]. Concept-Based Mining Model professionally identifies the important matching concepts among articles, according to the semantics of their sentences. The similarity between documents is calculated based on a new concept based relationship [4]. Automatic Text Categorization (ATC) is studied under a Communication System perspective feature space dimensionality reduction. The feature space dimensionality reduction has been undertaken by a two-level supervised scheme. Communication theoretical modeling aspect, with special stress on the synthesis of prototype documents via the generative model are always depend on the document coding optimal design [6]. Multidimensional Scaling (MDS) self-possessed with Procrustes CCA (Canonical Correlation Analysis) and JOFC (Joint Optimization of Fidelity and Commensurability) developed for a particular text document classification application [8]. Subspace decision cluster classification (SDCC) model consists of a disjoint subspace decision clusters. Each one labeled with an overriding class to determine the class of new objects falling in the cluster. A cluster tree is at the start generating from a training data set by recursively calls a subspace clustering algorithm using Entropy Weighting k-Means algorithm [9]. Learning methods discovers the underlying relations between images and texts based on small training data sets [10]. Similarity  based multilingual retrieval paradigm, using advanced similarity calculation methods fails to result in a better performance. Sequence classification model is defined, based on a set of chronological patterns and two sets of weights. The two set of weights, one for the patterns and one for classes. The employment of different scoring functions and other machine learning approaches, such as linear mode or neural networks, for identifying the optimal weight values are not addressed. Finally, the extension of the methodology in order to handle time series, through the use of discretization techniques are not examined [3]. The above issues can be treated by employing a pattern reduction and selection algorithm,  Novel class detection problem becomes more difficult in the existence of concept drift, when the underlying data distributions evolve in streams. The classification model occasionally needs to remain and does not address the data stream classification problem under active feature sets [14]. An objective function is constructed by combining jointly the global loss of the local spline regressions and the squared errors of the class labels of the labeled data [11]. A transductive classification algorithm is initiated in which a globally optimal classification performed. Finally obtained but does not develop an algorithm for image segmentation and image matting. Feature Relation Network (FRN) considers semantic information and also leverages the syntactic relationships between n-gram features but not appropriate for other text classification problems [15]. The core mechanism fails to add a more complex algorithm for the creation of the summaries [5]. In order not to make a too complex system that requires long execution times. Additionally, the fact that balancing factors were used, still, the greater in length sentences were gaining more weight than the shorter ones. Accordingly this implies that several short but comprehensive sentences may be omitted. ML-based methodology for building an application that is competent of identifying and disseminating healthcare information [12] fails to extend the experimental methodology. The focus is not in integrating the research discoveries in a framework to consumers. Personalized ontology model represented over user profiles but fails to generate user limited instance repositories to go with the representation of a global knowledge base. The current system assumes that all user local instance repositories have contented based descriptors referring to the subjects, however, a large volume of documents existing on the web may not have sufficient such content-based descriptors [13]. Moreover, we now discuss to make the preprocessing mechanism more effective and categorize the multi-label results on text categorization. In summary, our contributions are: (i)   EM algorithm removes the potentially noisy articles from the dexter dataset using the descending porthole technique (ii)   After, EM algorithm based preprocessing; SVM cluster generates a more efficient multi-label representation of articles.    International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163   Volume 1 Issue 8 (September 2014 ) _______________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved Page - 165   (iii)   Multi label categorization of text results achieved on dexter datas. (iv)   Finally, Hybridization of EM algorithm and SVM cluster improve the similarity score level of text in articles. III.   METHODOLOGY The noisy articles are usually located in a dataset and descending porthole technique is employed to effectively detect the noisy area. By estimate the entropy of mixed articles in a descending porthole, the noisy areas are found and all the articles in the area are regards as unlabeled data. EM algorithm efficiently holds unlabeled articles by providing the solution to extract and remove noisy articles from unlabeled data. Subsequent process is clustering of a more sophisticated text categorization method using Content Holdup (CH) method. CH approach is used instead of articles in a feature cluster space, where each cluster is a distribution over article classes. Support Vector Machine (SVM) cluster allows for the best reported result for a multi level categorization of dexter dataset. SVM with content holdup method generates a more efficient multi-label representation of the articles. Hybridization of EM algorithm and SVM cluster for efficient text categorization consists of two modules namely instruction module and text categorization module. The architecture diagram of the Hybridization of EM algorithm and SVM cluster for efficient text categorization is shown in Fig 1. Fig 3.1 Hybridization of EM algorithm and SVM cluster Process Fig 3.1 demonstrates the Hybridization of EM algorithm and SVM cluster on dexter dataset. Instructed module holds the list of process undergone, whereas the text categorization module holds the processing technique. EM algorithm and SVM cluster are combined together in order to make the preprocessing mechanism more effective and do not categorize the multi-label result. 3.1 Expectation Maximization for preprocessing The proposed Expectation Maximization (EM) approach consists of the following four steps namely one adjacent to the rest method, calculating prediction scores to remove noise, calculating entropy using the descending porthole technique, and the EM algorithm. In the one adjacent to the rest method, the article of one group is regarded as positive examples and the documents of the other categories as negative examples. In order to set up training data into binary classification from dexter data Set, multi class setting is reformed into the binary setting using the one adjacent to the rest method.   Instruction Module Text Categorization Module   Instructed Data Data from Dexter Dataset Preprocessing EM Algorithm Clustering SVM Clustering Technique Indexing Multi-level Text Categorization    International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163   Volume 1 Issue 8 (September 2014 ) _______________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved Page - 166   The goal is to discover a edge area which denotes an area including many noisy articles. First of all, using a positive data set and a negative data set for each category from the one adjacent to the rest method, learn a Naive Bayes (NB) classifier and obtain a prediction score for each document by the following formula. ………. Eqn (1) Where, is a group and means an article of. means a probability of the article to be positive in , and means a probability of the article to be negative in . According to these prediction scores, the entire articles of each group are sorted out in the descending order. Probabilities, is generally calculated as follows ………… Eqn (2) = ……….. Eqn (3) ) ……….. Eqn (4) Where the ith word in the vocabulary, V is is the size of the vocabulary, and is the frequency of word in article In EM method, an edge detected in a block with the most mixed degree of positive and negative articles. The descending porthole technique is first used to detect the block. In EM technique, porthole of a certain size is descending from the top article to the last article in a list ordered by the prediction scores. An entropy value is calculated for estimating the mixed degree of each porthole to remove noise ratio for effective preprocessing as Entropy (P) = ………… Eqn (5) Where, given a porthole (P), q+ is the proportion of positive articles in P and is the proportion of negative articles in P. For example, if a porthole of five articles has three positive articles and two negative articles, the proportions of positive articles and negative articles are 3/5 and 2/5 respectively. Thus the final predictable entropy value is calculated when using five articles for porthole operation. Two portholes with the highest entropy value are picked up; one porthole is firstly detected from the top and the other is firstly detected from the bottom. If there is no porthole or only one porthole with the highest entropy value, porthole with the next highest entropy value becomes targets of the selected windows. Then maximum (max) and minimum (min) threshold values are searched from selected windows, respectively. The max threshold value is found as the highest prediction score of a negative articles in the former window and the min threshold value is as the lowest prediction score of a positive article in the latter  porthole. The articles between max and min threshold values has three classes for training articles namely absolutely positive articles , unlabeled articles , definitely negative articles. By applying the EM algorithm to these three data sets, extract actual noisy articles and remove them. EM algorithm is used to pick out noisy articles from unlabeled article for effective  preprocessing operation. The universal EM algorithm consists of two steps srcinally trains a classifier using the obtainable labelled articles and labels the unlabeled articles by rigid classification.    //EM algorithm Begin Step 1: Each article with P (positive data) and N (negative data) are assigned Step 2: is article of P, is article of N, is unlabeled data. Step 3: P’ {}, N’ {} uses the current NB classifier using adjacent to the rest method Step 4: For each article, Step 5: If Step 6: Continue Step 7: Else End


Jul 23, 2017
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks