Outline
- Abstract
- Index Terms
- I. Introduction
- II. Background Study
- A. Data Mining
- B. Algorithm Used in This Study
- 2.2.1 Support Vector Machine
- 2.2.2 Naïve Bayes
- 2.2.3 Decision Tree
- 2.2.4 Feature Selection
- 2.2.5 Classification and Prediction
- III. Related Work
- IV. Proposed Work
- V. Experiment and Analysis
- A. Experiment I
- B. Experiment II
- Vi.conclusion
- References
رئوس مطالب
- چکیده
- مقدمه
- سابقه مطالعات
- الگوریتم مورد استفاده در این مقاله
- Naïve Bayes
- تعیین درخت
- ویژگی های انتخاب
- طبقه بندی و پیشگویی
- کار های انجام شده
- ساختار پیشنهادی
- آنالیز و تحلیل
- آزمایش اول
- آزمایش دوم
- نتیجه گیری
Abstract
As web is expanding day by day and people generally rely on web for communication so e-mails are the fastest way to send information from one place to another. Now a day’s all the transactions all the communication whether general or of business taking place through e-mails. E-mail is an effective tool for communication as it saves a lot of time and cost. But e-mails are also affected by attacks which include Spam Mails. Spam is the use of electronic messaging systems to send bulk data. Spam is flooding the Internet with many copies of the same message, in an attempt to force the message on people who would not otherwise choose to receive it. In this study, we analyze various data mining approach to spam dataset in order to find out the best classifier for email classification. In this paper we analyze the performance of various classifiers with feature selection algorithm and without feature selection algorithm. Initially we experiment with the entire dataset without selecting the features and apply classifiers one by one and check the results. Then we apply Best-First feature selection algorithm in order to select the desired features and then apply various classifiers for classification. In this study it has been found that results are improved in terms of accuracy when we embed feature selection process in the experiment. Finally we found Random Tree as best classifier for spam mail classification with accuracy = 99.72%. Still none of the algorithm achieves 100% accuracy in classifying spam emails but Random Tree is very nearby to that.
Keywords: Classifier - E-mails - Feature Selection - Spam MailsConclusions
In Spam mail classification is major area of concern these days as it helps in the detection of unwanted emails and threats. So now a day’s most of the researchers are working in this area in order to find out the best classifier for detecting the spam mails. So a filter is required with high accuracy to filter the unwanted mails or spam mails. In this paper we focussed on finding the best classifier for spam mail classification using Data Mining techniques. So we applied various classification algorithms on the given input data set and check the results. From this study we analyze that classifiers works well when we embed feature selection approach in the classification process that is the accuracy improved drastically when classifiers are applied on the reduced data set instead of the entire data set. The results gained were promising Accuracy of the classifier Random Tree is 99.715% with best-first feature selection algorithm and accuracy is 90.93% only when we don’t apply this subset selection algorithm. So, here in this study we achieve highest accuracy = 99.715%. As we all know that it is very difficult to achieve 100% accuracy but Random Tree and Random Forest (accuracy>99%) is very nearby to that. Therefore it is find that tree like classifiers works well in spam mail detection and accuracy improved incredibly when we first apply feature selection algorithm into the entire process.