Skip to content

(580th place - Top 24%) Repository for the "Microsoft Malware Prediction" Kaggle competition.

License

Notifications You must be signed in to change notification settings

dimitreOliveira/MicrosoftMalwarePrediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About the repository

The goal of this repository is to use the Kaggle "Microsoft Malware Prediction competition" data and apply data science techniques to predict if a machine will have malware. The bigger challenges on this competition are the huge dataset, and finding ways to run it on Kaggle kernel, Google colab or on a local machine (Memory issues), and also the high number of features.

Our team published Kaggle kernels:

What you will find

  • Deep_learning [link]
    • Tensorflow model (Estimator API) [link]
    • End-to-end model with Tensorflow [link]
  • Documentation [link]
    • Project working cycle and effort, relevant content and insights [link]
  • EDA [link]
    • Analysis of Train Dataset Distribution [link]
    • Analysis of the Distribution Between Test and Train [link]
    • Encoding evaluation with datasets distribution [link]
    • Encoding for binary features [link]
    • Encoding for features with high cardinality [link]
    • Encoding for features with low cardinality [link]
    • Malware Detection - EDA and LGBM [link]
    • Malware Detection - Extended EDA [link]
    • Feature type and cardinality [link]
    • Missing study (high) [link]
    • Missing study (low) [link]
    • Version and Build features [link]
    • Binary features EDA [link]
    • Numerical features EDA [link]
    • Train validation split [link]
  • Model backlog [link]
    • Models generated on Google colab [link]
    • Models generated on Kaggle [link]
    • Model backlog [link]
  • Utils [link]
    • Auxiliary script to merge data sets [link]

Microsoft Malware Prediction

Can you predict if a machine will soon be hit with malware?

Kaggle competition: https://www.kaggle.com/c/microsoft-malware-prediction Our insights are here.

Overview

The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways.

With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security.

As one part of their overall strategy for doing so, Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences.

Can you help protect more than one billion machines from damage BEFORE it happens?

Acknowledgments

This competition is hosted by Microsoft, Windows Defender ATP Research, Northeastern University College of Computer and Information Science, and Georgia Tech Institute for Information Security & Privacy.

Dependencies by folder:

TODO (in case anyone wants to continue this work):

  • The Tensorflow End-to-end part was stopped for taking too much effort, I was able to iterate faster with Keras, but it may be a good option for longer training using neural networks.
  • Try model staking, it may help with predictions but will need different training and validation of the models.
  • Features like AvSigVersion and OSBuild may be mapped to datetime, as was pointed by other competitors, then you create a timeline, this may help with train/validation splits.
  • Try Dimensionality reduction on high cardinality features, techniques like PCA, t-SNE or auto encoders may help.
  • Other approaches to parameter tunning like Bayesian optimization may help as well.
  • Techniques of model explainability like SHAP or ELI5 may help to finetune models.