Some of the posts in the future which contain the title Malware Features explain certain features that are used to classify/identify malware using the machine learning approach. We begin this series by analysing the static features first. The first feature that we are going to look at is hash value. Basically the point of a hashing function is to produce a unique value from a given input. The input in this case would be a binary executable.
Some of the common hashing algorithms one would have come across include MD5, SHA1 and so on. Some of the features that such algorithms have to satisfy are:
- Pre-Image resistance: It must be impossible to reconstruct the input data from the hash value.
- Second Pre-Image Resistance: It must be impossible to modify the input without changing the hash value of the bit. (This must be valid even for change of a single bit.)
- Collision resistance: It must be impossible to have two different inputs that return the same
Because cryptographic hashes have the following properties, they have a lot of use in information security field, such as digital signatures, message authentication codes and to make sure files have not been modified during transfer or such. If the original unmodified version of a file gives a specific hash value, then theoretically anyone who receives the file through any means or mode, simply just has to check the hash value to make sure he has received the copy as was intended by the author.Let’s look at an example using the MD5 hash function on image below:
What we do is calculate the hash value of this image. Then open the image in a hex editor and change just 1 byte of data. The hash value is then calculated for the modified file. Let’s look at the results and then discuss what this implies:
As we can see the change of a single bit in the image which does not even produce a human notice-able difference in the image produces a hash value that bears no similarity to any inferable extent. This is a problem, because if the malware simply replaces/changes the names of the variables, the fingerprint generated using hash value would be rendered useless.This is a problem. To counter this problem, the concept of Fuzzy Hashing was introduced.
Fuzzy hashing is basically a concept which tries to represent the output of the hash as a percentage similarity of two files. This could be used to defeat the previous case. Hence problems in which the labels or such are changed in the code. The following image shows the fuzzy hash of the two images using a command line tool called ssdeep.
As is apparent the only difference in the hash value using fuzzy hashing is the character after HqqH. This is achieved by basically breaking the input piece-wise and calculating individual hashes.
Why this feature is not enough
The fuzzy hash helps only if there are minute changes in the binary code. If the malware author uses an encryption engine to encrypt the files using different keys and/or additional plaintext elements (polymorphic or metamorphic malware), then this approach is effectively defeated since the binary encrypted forms by definition cannot be the same or even closely related. This approach can be used for in memory hashes though, since no matter how they are encrypted and stored on the disk, malware will have to be decrypted before execution in main memory. The obvious downside would be the real-time lag that would be induced by the use of such real time monitoring systems.