Outline

  • Abstract
  • 1. Introduction
  • 2. Background
  • 3. Methods
  • 4 Evaluation
  • 5 Experiments and Results
  • 6. Discussion and Conclusions
  • Notes
  • Supplementary Material
  • References

رئوس مطالب

  • چکیده
  • 1. مقدمه
  • 2. پیشینه
  • 1 2. تشخیص بدافزار ناشناخته با استفاده از الگوهای Byte N-Grams
  • 2 2. نمایش فایل های اجرایی با استفاده از OpCodes
  • 3 2. مسئله عدم تعادل
  • 3. روشها
  • 2 3. ساخت مجموعه داده
  • 3 3. آماده سازی داده ها و انتخاب ویژگی
  • 4. ارزیابی
  • 5. آزمایشات و نتایج
  • 1 5. آزمایش
  • 1 1 5. نمایش ویژگی در برابر n-grams
  • 2 1 5. انتخاب ویژگی و انتخاب های برتر
  • 3 1 5. دسته بندها
  • 4 1 5. تغییر اندازه های OpCode n-gram
  • 2 5. آزمایش
  • 3 5. آزمایش
  • 6. بحث و نتیجه گیری

Abstract

In previous studies classification algorithms were employed successfully for the detection of unknown malicious code. Most of these studies extracted features based on byte n-grampatterns in order to represent the inspected files. In this study we represent the inspected files using OpCode n-gram patterns which are extracted from the files after disassembly. The OpCode n-gram patterns are used as features for the classification process. The classification process main goal is to detect unknown malware within a set of suspected files which will later be included in antivirus software as signatures. A rigorous evaluation was performed using a test collection comprising of more than 30,000 files, in which various settings of OpCode n-gram patterns of various size representations and eight types of classifiers were evaluated. A typical problem of this domain is the imbalance problem in which the distribution of the classes in real life varies. We investigated the imbalance problem, referring to several real-life scenarios in which malicious files are expected to be about 10% of the total inspected files. Lastly, we present a chronological evaluation in which the frequent need for updating the training set was evaluated. Evaluation results indicate that the evaluated methodology achieves a level of accuracy higher than 96% (with TPR above 0.95 and FPR approximately 0.1), which slightly improves the results in previous studies that use byte n-gram representation. The chronological evaluation showed a clear trend in which the performance improves as the training set is more updated.

Keywords: - - -

دانلود ترجمه تخصصی این مقاله دانلود رایگان فایل pdf انگلیسی