Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Magic' operation - automatically detect and run operations #239

Merged
merged 20 commits into from
May 20, 2018

Conversation

n1474335
Copy link
Member

@n1474335 n1474335 commented Jan 23, 2018

Summary

The 'Magic' operation attempts to automatically detect what format the input data is in and which operations can be used to make more sense of it. It does this through a variety of methods:

  1. Using magic bytes to detect file types.
  2. Using regular expressions to detect data encoded in a specific format (e.g. Base64, hex, URL encoding etc.)
  3. Using byte frequency analysis to detect how closely the data matches various natural languages.

Once a possible encoding has been detected, the 'Magic' operation performs that operation and carries out the same process again. This can continue for several levels, controlled by the 'Depth' argument.

Examples

screenshot from 2018-01-22 23-54-52
This example shows the 'Magic' operation detecting three levels of encoding. The results are listed in order of likelihood. The first row shows that the three operations 'From Base64', 'Gunzip' and 'From Hex' will result in an output that looks quite likely to be written in English. The second row shows that just running 'From Base64' results in an output that looks like a gzip file. The third row shows that the raw data without any operations applied doesn't look very much like any language, although it is closest to Portuguese.


screenshot from 2018-01-23 00-06-22
This example shows a PNG image which has been URL and Base32 encoded. The 'Magic' operation has correctly detected these encodings and has also discovered that the 'Render Image' operation can be used to further improve the recipe.


screenshot from 2018-01-23 00-41-36
This example shows the 'Magic' operation correctly discovering Hindi text underneath three levels of encoding and compression.

Details

The three detection methods mentioned above are explained here in further detail.

Magic bytes

This detection method was already available in CyberChef in the form of the 'Detect file type' operation. It has been incorporated into this operation to provide further metadata to make decisions from.

Regular expressions to detect encodings

Patterns have been added to all relevant operations in the OperationConfig.js file. These patterns specify as strictly as possible what the data should look like if it is to match the operation. For example, the following configuration is used for the 'From Base64' operation:

{
    match: "^(?:[A-Z\\d+/]{4})+(?:[A-Z\\d+/]{2}==|[A-Z\\d+/]{3}=)?$",
    flags: "i",
    args: ["A-Za-z0-9+/=", false]
}

Alternative patterns can be added for use with different arguments, for example Base64 encoding using the BinHex alphabet is specified like so:

{
    match: "^[!\"#$%&'()*+,\\-0-689@A-NP-VX-Z[`a-fh-mp-r]{20,}$",
    flags: "",
    args: ["!-,-0-689@A-NP-VX-Z[`a-fh-mp-r", false]
}

Byte frequency analysis

Using Pearson's Chi-Squared test, we can determine how closely a given set of data matches the byte frequency of a certain language. To generate the truth data, I downloaded dumps of Wikipedia in 284 different languages, stripped out the wiki formatting, then measured the frequency of every byte. This gave me a set of data, unique to each language, which shows how common each byte is when the characters are encoded in UTF-8.

Future improvements

  • Show which operations the data matches even if the 'Depth' argument does not allow running them
  • Add entropy calculations for each branch. Entire input and sliding window to generate structural map.
  • Run speculative XOR, ROT, Bit shift and Rotate brute forcing. This should be optional as it will drastically increase the running time.
  • Attempt to convert data from various character encodings at each stage - does it match UTF8?
  • Add support for more languages
  • Allow the user to enter a crib pattern

@n1474335
Copy link
Member Author

An example of the new extensive language support and intensive brute-forcing capabilities:

image

@n1474335 n1474335 added this to the v8.0.0 milestone May 14, 2018
@n1474335 n1474335 changed the base branch from master to esm May 20, 2018 15:50
@n1474335 n1474335 merged commit ee519c7 into esm May 20, 2018
@n1474335 n1474335 deleted the feature-magic branch November 13, 2018 17:05
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant