DGA Intel

This deep learning model uses a CNN-LSTM architecture to predict whether a given domain name is genuine or was artificially generated by a DGA.

The Problem

Many forms of malware uses domain generation algorithms (DGAs) to connect with a C&C, which enables it to recieve instructions and perform malicious activities. There have been many attempts to detect whether a given domain name corresponds to a genuine domain, or a fake domain generated by a DGA. Some machine learning methods have utilized clustering based on WHOIS data, etc., to this end. This model builds on past work by using a deep learning architecture to achieve increased accuracy over other methods.

The Model

This model was based on an architecture from [2] and implemented in Tensorflow. It embeds domain names, feeds the embeddings through a convolutional network, feeds that through an LSTM, and passes that through a dense layer for classification. This approach captures the local similarity inherent in genuine domains, as well as spatial connections between characters.

The Data

The training data was a set of 1.5 million domain names labelled as either 0 (genuine) or 1 (fake) from the Splunk DGA app, Alexa's top 1 million domains, and the Bambenek DGA feed. 10% of domains were stripped of their TLD and subdomain before being fed through the model. The test data was a set of 100000 domains from a different slice of this data.

Results

The model was trained for twenty epochs with the Adam optimizer. It was tested by evaluating its predictive accuracy on 100000 domains from the shuffled test datasets. It achieved 98.8% accuracy on the test data.

Website Usage

You can query whether a given domain is legit or fake through this model at http://dgaintel.com/.

Development

The model can be loaded through Tensorflow's Keras API from the domain_classifier_model.h5 file. To further experiment with the code:

Go to Google Colab
Go to File > Open Notebook... > Github
Search for https://github.com/sudo-rushil/dga-intel
Open domain_data.ipynb or domain_model.ipynb

Code Usage

$ git clone https://github.com/sudo-rushil/dga-intel
$ cd dga-intel
$ python predict_domain.py [domain name]

Example

$ python predict_domain.py wikipedia.com

The domain wikipedia.com is genuine with probability 1.0

Contact

If you run across any issues, file an issue at https://github.com/sudo-rushil/dga-intel/issues.

My LinkedIn page can be found here.

References

[1] Abadi, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[2] Yu, Bin; Pan, Jie; Hu, Jiaming; Nascimento, Anderson; De Cock, Martine. "Character Level based Detection of DGA Domain Names". 2018 International Joint Conference on Neural Networks (IJCNN).

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
datasets		datasets
notebooks		notebooks
static		static
templates		templates
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
app.py		app.py
dga-intel.ini		dga-intel.ini
docker-compose.yml		docker-compose.yml
forms.py		forms.py
init-letsencrypt.sh		init-letsencrypt.sh
intel_query.py		intel_query.py
requirements.txt		requirements.txt
test.py		test.py
wsgi.py		wsgi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DGA Intel

The Problem

The Model

The Data

Results

Website Usage

Development

Code Usage

Example

Contact

References

About

Releases

Packages

Contributors 2

Languages

License

sudo-rushil/dga-intel-web

Folders and files

Latest commit

History

Repository files navigation

DGA Intel

The Problem

The Model

The Data

Results

Website Usage

Development

Code Usage

Example

Contact

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages