N-Grams & PE features: Malware Static Features

In this post, the second in series pertaining to analysis of the most common static features used to in the detection and classification of malware. In this post we analyse the features n-grams and PE file features.

N Grams

An n-gram is basically a contiguous subsequence of n items from a given sequence of items. In the context of malware detection and classification, the term n-gram would refer to the byte code n-grams or opcode n-grams. Opcode is basically that portion of machine language that specifies the operation to be performed. The basic procedure is to have a corpus of malicious programs and extract the most common n-gram bytes i.e. the most common sequences of n bytes that appear in all of the malware or atleast in a particular class/family of malware. Hence there are two variables in this approach, 1. the size of the window n and 2. the number of such frequent sequences used to profile the class, called the profile length. In the research paper [1], Abou-Assaleh et al, take labelled malicious and non-malicious programs and extract the most common n-gram bytecode features and use the k-nearest neighbour algorithm to classify new instances into families of malicious or non-malicious programs.

In [2], Moskovitch et al, use Opcode n-grams since it made more semantic sense. Plus simple post-processing could be done to abstract the code a bit by discatding the parameters and simply considering the sequence of opcodes. Rare patterns could be directly used as signatures. The reason behind this method was that new malware are rarely built from scratch, the generation engine or the polymorphic engine could be copied which would lead to specific sequences that provide high accuracy in detection of malware. In [3], Santos et al, use a weighted representation of the opcode sequences in malicious and benign programs, the logic being the rarer the opode sequence the better it was in classifying malware. The methods are not completely helpless against polymorphic or metamorphic malware, since the most frequent opcodes returned would be that for the encrypting code, which too is a highly re-used piece of code.

PE File Features

The Pe (Portable Executable) is a file format for executables, DLL’s and other files that contain code that is to be executed and must be loaded into memory. So PE is basically a data structure that encapsulates the information that is necessary for the OS Loader to manage the executable code. It contains a number of  headers that tell the dynamic linker how to map the file into memory. One of the other important sections is the Import Address Table that contains the list of modules/ libraries that are used by the program during execution. In the paper [4], Ye et al, built a parser that parser through the PE of the file that extract the information from the IAT and extract strings that contain interpretable semantic meaning. The paper explicitly states that the files that are used in the training of the SVM, are all unpacked executable files. The reason for this is but obvious since all static feature based approaches face this problem against detection of malware.

 

[1] Detection of new Malicious code using n-grams signatures by Tony Abou-Assaleh, Nick Cercone, Vlado Keselj and Ray Sweidan.

[2] Unknown Malcode Detection using OPCODE Representation by R. Moskovitch, C.Feher, N. Tzachar, E. Berger, M. Gitelman, S.Dolev and Y.Elovici.

[3] Opcode Sequences as Representation of Executables for Data-Mining based Unknown Malware Detection by Igor Santos, Felix Brezo, Xabier Ugarte-Pedrero and Pablo G. Bringas

[4] SBMDS: an interpretable string based malware detection system using svm ensemble with bagging by Y. Ye , L. Chen, D. Wang, T.Li, Q. Jiang and M. Zhao

Leave a comment