Skip to content

Using machine learning models to predict the probability of a windows system getting infected by various families of malware, based on different properties of that system.

Notifications You must be signed in to change notification settings

mehulthakral/MalwareDetection

 
 

Repository files navigation

MalwareDetection

Main Notebook containing everything from data cleaning, feature engineering, LGBM implementation to kaggle submission is in MalwareDetection_ExploratoryTerritory.ipynb along with detailed instructions.
It is recommended that the notebook is run in google colab with TPU as the hardware accelerator.
Link to notebook : https://colab.research.google.com/drive/1KkgpJfH5LvAtgoi2_H0Pr7Kjr5PaKab5
Link to video: https://www.youtube.com/watch?v=F_jVU_2fyn0&feature=youtu.be


Data sets

Data source for train and test data: Kaggle link for Microsoft Malware Prediction

  • Each row in this dataset corresponds to a system, uniquely identified by a MachineIdentifier.
  • HasDetections is the target and indicates whether Malware was detected on the system.
  • HasDetections is missing in the test dataset and must be predicted using the train dataset.

Data source for Antivirus Signature vs Timestamp : Kaggle link
This dataset consists of mappings from antivirus signature versions to timestamps. Antivirus signature version ('AvSigVersion') is updated approximately every 2 hours. 95% of user antiviruses regularly update their antivirus signature version making them a trustworthy timestamp for each dataset observation. This means that the antivirus signature version of a system when it was sampled can be mapped to the time at which the system was sampled. The timestamps from this dataset are provided by Microsoft. Microsoft has derived these timestamps in the manner explained above, by approximating sampling time from 'AvSigVersion' of the observation.

Pickled Objects : Kaggle link
train_df.pkl : Pickle of training data after preprocessing.
test_df.pkl : Pickle of testing data after preprocessing.
LGBMModel.pkl : Pickle of LGBM model after training.


Feature Engineering and Dimensionality Reduction

Feature engineering is done by feature.py

Additonally, notebook called "ExploratoryDataAnalysis.ipynb" provides data visualization along with contents of feature.py to back the feature choice and dropping.

To view the notebook on github use nbviewer as it gives proper rendering of plotly plots.

Alternatively, it is also hosted as private kernel at https://www.kaggle.com/mehulthakral/malware-detection-by-exploratory-territory

Steps to run :

  1. Install all requirements numpy,matplotlib,pandas and seaborn (all are there in kaggle)
  2. Additionally install chart_studio (from kaggle console in case of kaggle) for ploting plots using plotly.
  3. Change the path of the train.csv in "Loading the data" part use /kaggle/input/microsoft-malware-prediction/train.csv as path if using it in kaggle
  4. Change path of the destination where the new_train.csv must be stored

The new csv will have all the unwanted features removed.

Models Tried

  1. LSTM : LSTM was tried. It is available in LSTM.py. The AUC was 0.55 approx
  2. LSTM-CNN : Available in LSTM_CNN.py. results similar to LSTM were obtained
  3. LightGBM : Testing

Observation :

NeuralNets are not appropriate


Time Series
This kind of malware risk detection is in essence a time series problem, with the sampling date of each data point greatly influencing the some of the system’s properties. The given dataset is also split into test and train in such a way that a majority of entries in the train data are from August and September 2018 while the training data is mostly from October and November 2018.(As seen in LGBM_EDA.ipynb)
But the problems posed to the traditional time series approach by this dataset are the following:
• New systems are added to the dataset with time.
• There are systems that occasionally go offline for variable durations of time. No data from these systems are recorded in this period.
• Systems receive OS patches, bug fixes and OS upgrades over time thereby changing their properties.
This analysis is intuitive as newer versions of operating systems and antivirus software to combat ever-improving malware.
Light Gradient Boosting Machine

Given the shortcomings of having a plain time series perspective of the problem, it is best to have a final model that is not strictly a time-series approach to malware prediction, but can accommodate features that are indicative of time. Based on this there were two approaches we pursued: LSTM or Gradient boosting Decision Trees. To capture the time series aspect of the problem we engineer new features by making use of the Antivirus Signature vs Timestamp dataset.

Implementation of LGBM model along with detailed comments is in MalwareDetection_ExploratoryTerritory.ipynb

The LGBM model gives an AUC of 0.67 which is significantly better than LSTM's AUC of 0.50.

About

Using machine learning models to predict the probability of a windows system getting infected by various families of malware, based on different properties of that system.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 96.9%
  • Python 3.1%