This repository contains the implementation of a deep learning model for real-life violence detection using the Vision Transformer for video classification (ViViT) architecture. The model is trained on the Real Life Violence Situations Dataset, hosted on Kaggle.
notebooks/
: Jupyter notebooks for data exploration, model training, and evaluation.src/
: Source code for the project.base/
: this folder contains the abstract class of the model and trainer.data_loader/
: Data preprocessing and loading scripts.models/
: Implementation of the Vision Transformer model.trainers/
: trainer class for a custom training loop.
datasets/
: Placeholder for the Real Life Violence Situations Dataset (not included in this repository).
- Python 3.8+
- TensorFlow 2.13+
- Other dependencies specified in
requirements.txt
-
Clone the repository:
git clone https://github.com/your-username/real-life-violence-detection.git cd real-life-violence-detection
-
Install dependencies:
pip install -r requirements.txt
-
Download the Real Life Violence Situations Dataset from Kaggle and place it in the datasets/ directory.
-
Run the Jupyter notebooks in the notebooks/ directory for data exploration.
-
To train the model, execute:
python src/train.py
-
Evaluate the trained model:
python src/evaluate.py
the performance of a Vision Transformer-based model for real-life violence detection, trained using Kaggle's P100 GPU gave promising descriptive statistics for the model's performance across multiple metrics. The mean accuracy across 30 epochs reached 85%, with a standard deviation of 2%. The precision and recall scores for violence detection were consistent, averaging 0.88 and 0.86, respectively.
Accuracy Curve | Loss Curve |
---|---|
The Vision Transformer (ViT) architecture, introduced by Alexey Dosovitskiy and his colleagues at Google Research, is a novel approach to computer vision tasks, particularly image classification. Unlike traditional Convolutional Neural Networks (CNNs), which have been dominant in image processing tasks, ViT uses a transformer architecture, originally designed for natural language processing tasks. Below is a detailed explanation of the Vision Transformer architecture:
- Instead of processing individual images, the ViT for videos would take sequences of video frames as input.
- Video frames are divided into fixed-size non-overlapping patches similar to the original ViT for images.
- Each patch in the sequence represents a frame in the video, and the entire sequence forms a temporal representation.
- Tokens are created for each patch, and the sequence of these tokens represents the temporal evolution of the video.
- To capture both spatial and temporal features, each patch is linearly embedded into a high-dimensional vector using a 3D linear projection.
- The 3D token embedding (tubelet embedding) includes spatial information within each frame and temporal information across frames.
- Positional embeddings are added to the 3D token embeddings to encode spatial and temporal information.
- These embeddings convey both the spatial location within a frame and the temporal order across frames.
- The core of the ViT architecture consists of multiple layers of transformer encoder blocks.
- Each encoder block typically includes:
- Multi-Head Self-Attention Mechanism (MSA):
- Enables tokens to attend to different parts of the input sequence, capturing global and local dependencies.
- Feedforward Neural Network (FFN):
- Applies a non-linear transformation to the attended features.
- Layer Normalization and Residual Connections:
- Enhances the stability and training of the model.
- Multi-Head Self-Attention Mechanism (MSA):
- After passing through the transformer encoder blocks, the output token embeddings are used for the final classification.
- A special token (CLS token) is added at the beginning of the sequence, and its final embedding is used as a summary representation for the entire input video.
- The CLS token's embedding is then fed into a classification head, considering both spatial and temporal features.
- Data Augmentation
- Hyperparamter Tuning
- Learning rate Scheduler
- ViViT: A Video Vision Transformer
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- How to Train Vision Transformer on Small-scale Datasets?
- Training a Vision Transformer from scratch in less than 24 hours with 1 GPU
This project is licensed under the .
Abdulrahman Adel Ibrahim
Email: abdulrahman.adel098@gmail.com
Feel free to reach out with any questions or suggestions!