Documents

IJIRAE:: CLASSIFICATION PROBLEM IN TEXT MINING

Categories
Published
of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Description
Data mining is the process of extracting previously unknown, potentially useful information by the application of some kind of intelligent techniques. It is an analysis task which derives useful patterns and trends from the data repository. Data mining is an important task in the whole knowledge discovery process. Text mining is the analysis of data contained in natural language text. Classification is a basic functionality in mining. Classification has wide applications in the area of data mining, machine learning, database, and information retrieval etc. The various diverse applications include target marketing, medical diagnosis, news group filtering, and document organization. In this paper we explore the use of classification technique in the field of text mining. In this paper we also do a comparative study on a wide variety of text classification algorithms used in text mining. We explore the application of classification in text mining in areas like E mail spam filtering, Opinion mining, text filtering of news articles etc. The various algorithms considered are Decision Tree, Bayesian classification, neural classification and so on
Transcript
    International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163   Volume 1 Issue 8 (September 2014 )   www.ijirae.com  _________________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved Page - 333 CLASSIFICATION PROBLEM IN TEXT MINING Jijy George Sandhya .N. Suja George  Assistant Professor Assistant Professor Assistant Professor  Department of Computer Science, Department of Computer Science, Department of Computer Science, St. Joseph’s College (Autonomous), St. Joseph’s College (Autonomous), St. Joseph’s College (Autonomous),  Langford Road, Shanthinagar, Langford Road, Shanthinagar, Langford Road, Shanthinagar,  Bangalore – 560027, India. Bangalore – 560027, India. Bangalore – 560027, India.  Abstract-- Data mining is the process of extracting previously unknown, potentially useful information by the application of  some kind of intelligent techniques. It is an analysis task which derives useful patterns and trends from the data repository.  Data mining is an important task in the whole knowledge discovery process.   Text mining is the analysis of data contained in  natural language text. Classification is a basic functionality in mining. Classification has wide applications in the area of  data mining, machine learning, database, and information retrieval etc. The various diverse applications include target  marketing, medical diagnosis, news group filtering, and document organization. In this paper we explore the use of  classification technique in the field of text mining. In this paper we also do a comparative study on a wide variety of text  classification algorithms used in text mining. We explore the application of classification in text mining in areas like E mail  spam filtering, Opinion mining, text filtering of news articles etc. The various algorithms considered are Decision Tree,  Bayesian classification, neural classification and so on.  Keywords: Data Mining, Text Mining, Text Classification, E mail spam filter, Digital Libraries, Opinion mining, Support Vector Machine. I.   Introduction Data mining is the process of extracting information from a data set and transform it into an understandable form for further use. The data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records(cluster analysis), unusual records (anomaly detection) and dependencies (association rule mining)[1][2]. Text mining also referred to as text data mining is the process of deriving high quality information from text. This high-quality information is derived through finding out patterns and trends through methods like statistical pattern learning [3]. Text mining usually involves the process of structuring the input text usually by parsing and then adding some derived linguistic features and the removal of some others, and finally insertion into a database. From this structured data, patterns are derived, evaluated and interpreted. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. The problem of classification is defined as follows. We have a set of training records  D = {X  1  , . . . , XN} , such that each record is labeled with a class value drawn from a set of k different discrete values indexed by {  1 . . . k} . The training data is used in order to construct a classification model, which relates the features in the underlying record to one of the class labels. For a given test instance   for which the class is unknown, the training model is used to predict a class label for this instance [4]. In the hard version   of the classification problem, a particular label is explicitly assigned to the instance, whereas in the soft version   of the classification problem, a probability value is assigned to the test instance. Other variations of the classification problem allow ranking of different class choices for a test instance, or allow the assignment of multiple labels to a test instance. The problem of classification finds applications in different areas in the field of text mining. Some examples of areas in which classification is generally used are the following: Text Filtering of News Articles : Most of the news services today are electronic in nature in which a large volume of news articles are created very single day by the organizations. In such cases, it is difficult to organize the news articles manually. Therefore, automated methods can be very useful for news categorization in a variety of web portals. This application is also referred to as text filtering. Digital libraries : This application is generally useful for many applications beyond news filtering and organization. A variety of supervised methods may be used for document organization in many domains. These include large digital libraries of documents, web collections, and scientific literature. Hierarchically organized document collections can be particularly useful for browsing and retrieval.    International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163   Volume 1 Issue 8 (September 2014 )   www.ijirae.com  _________________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved Page - 334   Opinion Mining :  Sentiment analysis and opinion mining is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions from written language. It is one of the most active research areas in natural language  processing and is also widely studied in data mining, Web mining, and text mining.   Customer reviews or opinions are often short text documents which can be mined to determine useful information from the review Email Classification and Spam Filtering : It is often desirable to classify email in order to determine either the subject or to determine junk email in an automated way. This is also referred to as spam filtering or email filtering. II.   Classification “How does classification work?   Data classification [4] is a two-step process. In the first step, a classifier is built describing a  predetermined set of data classes or concepts. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or “learning from” a training set made up of database tuples and their associated class labels. A tuple,  X  , is represented by an n -dimensional attribute vector,  X = (  x 1,  x 2, : : : ,  xn ), depicting n measurements made on the tuple from n database attributes, respectively,  A 1,  A 2, : : : ,  An .1 Each tuple,  X  , is assumed to belong to a predefined class as determined by another database attribute called the class label attribute. The class label attribute is discrete-valued and unordered. It is categorical   in that each value serves as a category or class. The individual tuples making up the training set are referred to as training tuples and are selected from the database under analysis. In the context of classification, data tuples can be referred to as samples, examples, instances, data points, or objects. Because the class label of each training tuple is provided, this step is also known as supervised learning[4]. Figure 1 III.Classification Algorithms A wide variety of classification algorithms are used in the area of text mining. Here we discuss algorithms like Decision Tree classification, Rule Based classification, Bayesian classification and Neural Network classification. The application of these algorithms in the real time text mining applications like Email classification and Spam Filtering, Text Filtering of News Articles and Opinion Mining are also discussed. Decision Tree Classification : Decision trees are designed with the use of a hierarchical division of the underlying data space with the use of different text features [5]. The hierarchical division of the data space is designed in order to create class partitions which are more skewed in terms of their class distribution. For a given text instance, we determine the partition that it is most likely to  belong to, and use it for the purposes of classification. Decision tree induction is the learning of decision trees from class-labeled training tuples[4]. A decision tree is a flowchart-like tree structure, where each internal node (non leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node ) holds a class label. The topmost node in a tree is the root node. “How are decision trees used for classification?”. Given a tuple, X , for which the associated class label is unknown, the attribute values of the tuple are tested against the decision tree. A path is traced from the root to a leaf node, which holds the class  prediction for that tuple. Decision trees can easily be converted to classification rules.    International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163   Volume 1 Issue 8 (September 2014 )   www.ijirae.com  _________________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved Page - 335 Figure 2 Pattern (Rule)-based Classifiers: In rule-based classifiers [5] we determine the word patterns which are most likely to be related to the different classes. We construct a set of rules, in which the left hand side corresponds to a word pattern, and the right-hand side corresponds to a class label[5]. These rules are used for the purposes of classification.  Neural Network Classifiers :  Neural networks are used in a wide variety of domains for the purposes of classification. In the context of text data, the main difference for neural network classifiers [5]is to adapt these classifiers with the use of word features. We note that neural network classifiers are related to SVM classifiers; indeed, they both are in the category of discriminative classifiers, which are in contrast with the generative classifiers. Bayesian (Generative) Classifiers : In Bayesian classifiers (also called generative classifiers)[5], we attempt to build a probabilistic classifier based on modeling the underlying word features in different classes. The idea is then to classify text based on the  posterior probability of the documents belonging to the different classes on the basis of the word presence in the documents. Other Classifiers : Almost all classifiers can be applied to the case of text analysis. Some of the other classifiers include nearest neighbor classifiers, and genetic algorithm-based classifiers. IV.   Area of study   A.   Email spam filtering As the Internet grows at a exceptional rate, electronic mail (abbreviated as E-mail) has become a widely used electronic form of communication on the Internet. Daily, millions of people exchange messages in this fast and cheap way. As the popularity of electronic commerce increasing, the usage of E-mail will increase more radically. However, the benefits of E-mail also make it overused by companies, organizations or people to promote products and spread information, which serves their own purposes. The mailbox of a user may often be jam-packed with E-mail messages some or even a large portion of which are un interesting to her/him. Looking for interesting messages everyday is becoming tiresome and frustrating. As a result, a personal E-mail filter is indeed needed. The process of building an E-mail filter can be put into the framework of text classification [6]. An E-mail message is viewed as a document, and a decision of interesting or not is viewed as a class label given to the E-mail document. While text classification has been well explored and various techniques have been reported, empirical study on the document type of E-mail and the features of building an effective personal E-mail filter in the framework of text classification is only modest.   B.   Bayesian spam filtering Bayesian spam filtering[1][6] is a statistical technique of e-mail filtering. In its basic form, it makes use of a naive Bayes classifier on bag of words features to identify spam e-mail, an approach commonly used in text classification. Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayesian inference to calculate a probability that an email is or is not spam. Naive Bayes spam filtering is a  baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low false  positive spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering, with roots in the 1990s. The process behind this technique is as follows: Particular words have particular probabilities of occurrence in spam email and in legitimate email. For example, most email users will frequently see the word Viagra in spam email, but will rarely see it in legitimate email. The filter doesn't know these    International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163   Volume 1 Issue 8 (September 2014 )   www.ijirae.com  _________________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved Page - 336    probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email, the filter will calculated the probabilities that each word will appear in spam or legitimate email in its database. For example, Bayesian spam filters will typically have learned a very high spam probability for the words Viagra and refinance , but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members. After training, the word probabilities (also known as likelihood functions) are used to compute the probability that an email with a  particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words. This contribution is called the posterior probability and is computed using Bayes' theorem. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam. As in any other spam filtering technique, email marked as spam can then be automatically moved to a Junk email folder, or even deleted outright. Some software implements quarantine mechanisms that define a time frame during which the user is allowed to review the software's decision. The initial training can usually be refined when wrong  judgements from the software are identified (false positives or false negatives). That allows the software to dynamically adapt to the ever evolving nature of spam. Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre-defined rules about the contents, looking at the message's envelope, etc.), resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness. C.   Text Filtering of News Articles In many real-world scenarios, the ability to automatically classify documents into a fixed set of cate gories is highly desirable. Common scenarios include classifying a large amount of unclassi fied archival documents such as newspaper articles, legal recor  ds and academic papers. For example, newspaper articles can be classi fied as ’features’, ’sports’ or ’ne ws’. Other scenarios involve classifying of documents as they are created. Examples include classifying movie review articles into ’positive’ or ’negative’ reviews or classifying only blog entries using a fixed set of labels.   V.   Classification algorithms for classification of News Articles: A.   Named Entities  Named entities [8] can be used as features for the classification of news articles by topic. The classification of news articles poses a considerable challenge to newspapers interested in identifying the interests of their users.
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks