MalwareDetection

Main Notebook containing everything from data cleaning, feature engineering, LGBM implementation to kaggle submission is in MalwareDetection_ExploratoryTerritory.ipynb along with detailed instructions.
It is recommended that the notebook is run in google colab with TPU as the hardware accelerator.
Link to notebook : https://colab.research.google.com/drive/1KkgpJfH5LvAtgoi2_H0Pr7Kjr5PaKab5
Link to video: https://www.youtube.com/watch?v=F_jVU_2fyn0&feature=youtu.be

Data sets

Data source for train and test data: Kaggle link for Microsoft Malware Prediction

Each row in this dataset corresponds to a system, uniquely identified by a MachineIdentifier.
HasDetections is the target and indicates whether Malware was detected on the system.
HasDetections is missing in the test dataset and must be predicted using the train dataset.

Data source for Antivirus Signature vs Timestamp : Kaggle link
This dataset consists of mappings from antivirus signature versions to timestamps. Antivirus signature version ('AvSigVersion') is updated approximately every 2 hours. 95% of user antiviruses regularly update their antivirus signature version making them a trustworthy timestamp for each dataset observation. This means that the antivirus signature version of a system when it was sampled can be mapped to the time at which the system was sampled. The timestamps from this dataset are provided by Microsoft. Microsoft has derived these timestamps in the manner explained above, by approximating sampling time from 'AvSigVersion' of the observation.

Pickled Objects : Kaggle link
train_df.pkl : Pickle of training data after preprocessing.
test_df.pkl : Pickle of testing data after preprocessing.
LGBMModel.pkl : Pickle of LGBM model after training.

Feature Engineering and Dimensionality Reduction

Feature engineering is done by feature.py

Additonally, notebook called "ExploratoryDataAnalysis.ipynb" provides data visualization along with contents of feature.py to back the feature choice and dropping.

To view the notebook on github use nbviewer as it gives proper rendering of plotly plots.

Alternatively, it is also hosted as private kernel at https://www.kaggle.com/mehulthakral/malware-detection-by-exploratory-territory

Steps to run :

Install all requirements numpy,matplotlib,pandas and seaborn (all are there in kaggle)
Additionally install chart_studio (from kaggle console in case of kaggle) for ploting plots using plotly.
Change the path of the train.csv in "Loading the data" part use /kaggle/input/microsoft-malware-prediction/train.csv as path if using it in kaggle
Change path of the destination where the new_train.csv must be stored

The new csv will have all the unwanted features removed.

Models Tried

LSTM : LSTM was tried. It is available in LSTM.py. The AUC was 0.55 approx
LSTM-CNN : Available in LSTM_CNN.py. results similar to LSTM were obtained
LightGBM : Testing

Observation :

NeuralNets are not appropriate

Time Series

This kind of malware risk detection is in essence a time series problem, with the sampling date of each data point greatly influencing the some of the system’s properties. The given dataset is also split into test and train in such a way that a majority of entries in the train data are from August and September 2018 while the training data is mostly from October and November 2018.(As seen in LGBM_EDA.ipynb)
But the problems posed to the traditional time series approach by this dataset are the following:
• New systems are added to the dataset with time.
• There are systems that occasionally go offline for variable durations of time. No data from these systems are recorded in this period.
• Systems receive OS patches, bug fixes and OS upgrades over time thereby changing their properties.
This analysis is intuitive as newer versions of operating systems and antivirus software to combat ever-improving malware.

Light Gradient Boosting Machine

Given the shortcomings of having a plain time series perspective of the problem, it is best to have a final model that is not strictly a time-series approach to malware prediction, but can accommodate features that are indicative of time. Based on this there were two approaches we pursued: LSTM or Gradient boosting Decision Trees. To capture the time series aspect of the problem we engineer new features by making use of the Antivirus Signature vs Timestamp dataset.

Implementation of LGBM model along with detailed comments is in MalwareDetection_ExploratoryTerritory.ipynb

The LGBM model gives an AUC of 0.67 which is significantly better than LSTM's AUC of 0.50.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
ExploratoryDataAnalysis.ipynb		ExploratoryDataAnalysis.ipynb
LSTM.py		LSTM.py
LSTM_CNN.py		LSTM_CNN.py
MalwareDetection_ExploratoryTerritory.ipynb		MalwareDetection_ExploratoryTerritory.ipynb
Malware_detection_Report.pdf		Malware_detection_Report.pdf
README.md		README.md
data_visualization.py		data_visualization.py
feature.py		feature.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MalwareDetection

Data sets

Time Series

Light Gradient Boosting Machine

About

Releases

Packages

Languages

mehulthakral/MalwareDetection

Folders and files

Latest commit

History

Repository files navigation

MalwareDetection

Data sets

Time Series

Light Gradient Boosting Machine

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages