Outline
- Abstract
- 1. Introduction
- 2. Background
- 3. Methods
- 4 Evaluation
- 5 Experiments and Results
- 6. Discussion and Conclusions
- Notes
- Supplementary Material
- References
رئوس مطالب
- چکیده
- 1. مقدمه
- 2. پیشینه
- 1 2. تشخیص بدافزار ناشناخته با استفاده از الگوهای Byte N-Grams
- 2 2. نمایش فایل های اجرایی با استفاده از OpCodes
- 3 2. مسئله عدم تعادل
- 3. روشها
- 2 3. ساخت مجموعه داده
- 3 3. آماده سازی داده ها و انتخاب ویژگی
- 4. ارزیابی
- 5. آزمایشات و نتایج
- 1 5. آزمایش
- 1 1 5. نمایش ویژگی در برابر n-grams
- 2 1 5. انتخاب ویژگی و انتخاب های برتر
- 3 1 5. دسته بندها
- 4 1 5. تغییر اندازه های OpCode n-gram
- 2 5. آزمایش
- 3 5. آزمایش
- 6. بحث و نتیجه گیری
Abstract
In previous studies classification algorithms were employed successfully for the detection of unknown malicious code. Most of these studies extracted features based on byte n-grampatterns in order to represent the inspected files. In this study we represent the inspected files using OpCode n-gram patterns which are extracted from the files after disassembly. The OpCode n-gram patterns are used as features for the classification process. The classification process main goal is to detect unknown malware within a set of suspected files which will later be included in antivirus software as signatures. A rigorous evaluation was performed using a test collection comprising of more than 30,000 files, in which various settings of OpCode n-gram patterns of various size representations and eight types of classifiers were evaluated. A typical problem of this domain is the imbalance problem in which the distribution of the classes in real life varies. We investigated the imbalance problem, referring to several real-life scenarios in which malicious files are expected to be about 10% of the total inspected files. Lastly, we present a chronological evaluation in which the frequent need for updating the training set was evaluated. Evaluation results indicate that the evaluated methodology achieves a level of accuracy higher than 96% (with TPR above 0.95 and FPR approximately 0.1), which slightly improves the results in previous studies that use byte n-gram representation. The chronological evaluation showed a clear trend in which the performance improves as the training set is more updated.
Keywords: Classification - Data Mining - Malicious Code Detection - OpCodeدانلود ترجمه تخصصی این مقاله دانلود رایگان فایل pdf انگلیسی