Skip to content

AVAuco/proxemicsnetplus

Repository files navigation

Proxemics-net++: classification of human interactions in still images

📚 Support Code for Proxemics-Net++ Method

Developed by Isabel Jiménez-Velasco, Jorge Zafra-Palma, Rafael Muñoz-Salinas and Manuel J. Marín-Jiménez.

DOI   |   Open in Colab

 

Examples of human-human interactions.

Figure 1: Examples of human-human interactions. These images illustrate the great complexity inherent in the problem of recognizing human interactions in images. The images in (a) highlight situations where it is confusing to determine the type of physical contact (hand-elbow, hand-shoulder, elbow-shoulder, etc.) due to clothing and partial occlusion. In (b), the images show ambiguity in determining the type of social relationship between individuals (family, friends, co-workers, etc.) without additional context.

 

Human interaction recognition (HIR) is a significant challenge in computer vision that focuses on identifying human interactions in images and videos. HIR presents a great complexity due to factors such as pose diversity, varying scene conditions, or the presence of multiple individuals. Recent research has explored different approaches to address it, with an increasing emphasis on human pose estimation. In this work, we propose Proxemics-Net++, an extension of the Proxemics-Net model, capable of addressing the problem of recognizing human interactions in images through two different tasks: the identification of the types of “touch codes” or proxemics and the identification of the type of social relationship between pairs. To achieve this, we use RGB and body pose information together with the state-of-the-art deep learning architecture, ConvNeXt, as the backbone. We performed an ablative analysis to understand how the combination of RGB and body pose information affects these two tasks. Experimental results show that body pose information contributes significantly to proxemic recognition (first task) as it allows to improve the existing state of the art, while its contribution in the classification of social relations (second task) is limited due to the ambiguity of labelling in this problem, resulting in RGB information being more influential in this task.

 

Our Proxemics-Net++ model

Figure 2: Our Proxemics-Net++ model. It consists of six inputs: three branches for the RGB information of the couple and the individuals that compose it (blue branches) and another three branches for the body pose representation of the two individuals and the couple (green branches). All branches have the same type of backbone (Base or Large). The outputs of these six branches are passed to a Fusion Block, which can be of two types: Concatenation fusion or CrossAttention fusion. Finally, the type of human interaction (proxemics or social relationship) of the input samples is predicted.

 

📈 Comparison to the State of the Art - Proxemics dataset

Model HH HS SS HT HE ES mAP (a) mAP (b)
Yang et al. 37 29 50 61 38 34 42 38
Chu et al. 41.2 35.4 62.2 - 43.9 55 - 46.6
Jiang et al. 59.7 52 53.9 33.2 36.1 36.2 45.2 47.5
Li W. et al. 56.7 55.1 52.8 78.4 65.0 65.5 62.3 59.1
Sousa et al. 66.2 55.1 69.5 78.8 65.6 68.1 67.2 64.9
Jiménez et al. 62.4 56.7 62.4 86.4 68.8 67.9 67.4 63.8
Our ConvNeXt_Base (CrossAttention) - (RGB+Pose - Full Model) 71.5 63.2 80.5 80.7 75.6 71.3 73.8 72.4

Table 1: Comparison of our best model obtained on the Proxemics dataset with the state of the art

 

In this Table, two values of %mAP are compared: mAP(a) is the value of mAP explained in the previous sections (the mean of the AP values of the six types of proxemics) and mAP(b) is the mean of the AP values but excluding the Hand-Torso (HT) class as done in Chu et al.

The comparison shows that our best model (RGB+Pose with individual and pair branches, Base variant, and CrossAttention Fusion Block) achieves the highest %mAP results across almost all proxemics types, outperforming existing methods with significant improvements of 6.4% (mAP(a)) and 7.5% (mAP(b)). These results show that our combination of RGB and pose data, alongside a deep learning model like ConvNeXt, significantly enhances performance for proxemics recognition.

 

📈 Comparison to the State of the Art - PISC dataset

Model Friends Family Couple Prof. Comm. No Rel. mAP
Li J. et al. 60.6 64.9 54.7 82.2 58 70.6 65.2
Zhang et al. 64.6 67.8 60.5 76.8 34.7 70.4 70.0
Goel et al. - - - - - - 71.6
Li W. et al. 60.8 65.9 84.8 73.0 51.7 70.4 72.7
Li L. et al. 82.2 39.4 33.2 60.0 47.7 71.8 73.3
Yang et al. 63.1 73.5 78.3 82.7 76.8 71.8 73.6
Sousa et al. 49.4 70.5 74.6 76.5 59.6 74.6 75.2
Our ConvNeXt_Base (Concat) (RGB - Full model) 56.2 83.9 77.6 61.0 59.0 82.9 70.1

Table 2: Comparison of our best model obtained on the PISC dataset with the state of the art

 

Table 2 compares our best model with existing state-of-the-art approaches for social interaction recognition. Although our model (RGB model with individual and pairs branches, Base variant, and Concatenation Fusion Block) performs best in the "Family" and "No Relation" categories, it doesn't surpass the current best overall (70.1% mAP vs. 75.2% mAP).

Notably, most other methods rely on graph-based architectures, unlike ours, which uses a deep neural network. Compared with the PISC authors' deep neural network approach (Li J. et al.), we achieve better results (70.1% mAP vs 65.2% mAP), showing that RGB data combined with ConvNeXt architecture enhances social interaction recognition. However, recent trends in graph-based architectures suggest that this problem requires models focused on relationships due to its complexity.

 


 

🚀 What's new?

  • base_model_main/: Main directory for the base model.
  • dataset/: Directory containing the code necessary for dataset preprocessing - RGB and Pose.
  • ĂŹmgs/: Directory containing the some images of this work
  • base_model_main/: Main directory for the full model - RGB / Pose / RGB+Pose models.
  • test/: Directory containing code and resources related to model testing.
  • train/: Directory containing code and resources related to model training.
  • requirements_proxemicsNet.txt: File specifying the necessary dependencies for the project.
  • requirements_retinaFace_detectron_densepose.txt: File specifying the necessary dependencies to obtain the pose maps of the images - Using RetinaFace and DensePose

 

🌟 Quick Start

â–Ș Installing Dependencies

Install the necessary dependencies to run this project, you can use the following command:

conda create --name <env> --file requirements_proxemicsNet.txt

Install the necessary dependencies to obtain the Pose maps from the images (in case of not using the datasets already provided or using new images):

conda create --name <env> --file requirements_retinaFace_detectron_densepose.txt

â–Ș Downloading and unzipping the Preprocessed Datasets ZIP

To use the pre-processed datasets, you must first download and unzip the files.

  • Proxemics and PISC datasets available at DOI

  • You can also use the following command:

    # Proxemcis dataset  (112MB)
    wget -O dataset_proxemics.zip "https://zenodo.org/records/11184513/files/dataset_proxemics.zip?download=1"
    unzip dataset_proxemics.zip
    
    # PISC dataset (3.76GB)
    wget -O dataset_pisc.zip "https://zenodo.org/records/11184513/files/dataset_pisc.zip?download=1"
    unzip dataset_pisc.zip 
    

â–Ș Downloading the pre-trained ConvNeXt models

To use the pre-trained ConvNeXt models that we have selected as a backbone to train our Proxemics-Net++ models, you need to download them from the following locations:

Once downloaded, you need to unzip them and place them one level above, i.e., in ../premodels/.

 

🌟 Training a New Model - RGB / Pose / RGB+Pose

To train and test a new model (Proxemcis or PISC), you should access the rgb_pose_model_main directory and execute the following command lines depending on the type of model you want to train:

  • For RGB models

    • Full Model (3 Branches)

       python3 rgb_pose_model_main_convNext.py --datasetDIR <DIR dataset/> --outModelsDIR <DIR where you'll save the model> --modeltype <base/large> --b <batchsize> --set <set1/set2> --lr <learningRate> --datasetName <proxemics/pisc> --rgb
      
    • Only Pair RGB

       python3 rgb_pose_model_main_convNext.py --datasetDIR <DIR dataset/> --outModelsDIR <DIR where you'll save the model> --modeltype <base/large> --b <batchsize> --set <set1/set2> --lr <learningRate> --datasetName <proxemics/pisc> --rgb --onlyPairRGB
      
  • For Pose models

    • Full Model (3 Branches)

       python3 rgb_pose_model_main_convNext.py --datasetDIR <DIR dataset/> --outModelsDIR <DIR where you'll save the model> --modeltype <base/large> --b <batchsize> --set <set1/set2> --lr <learningRate> --datasetName <proxemics/pisc>  --pose
      
    • Only Pair RGB

       python3 rgb_pose_model_main_convNext.py --datasetDIR <DIR dataset/> --outModelsDIR <DIR where you'll save the model> --modeltype <base/large> --b <batchsize> --set <set1/set2> --lr <learningRate> --datasetName <proxemics/pisc>  --pose --onlyPairPose
      
  • For RGB+Pose models

    • Full Model (6 Branches)

       python3 rgb_pose_model_main_convNext.py --datasetDIR <DIR dataset/> --outModelsDIR <DIR where you'll save the model> --modeltype <base/large> --b <batchsize> --set <set1/set2> --lr <learningRate> --datasetName <proxemics/pisc>  --rgb --pose  
      
    • RGB+Pose (Only Pair RGB)

       python3 rgb_pose_model_main_convNext.py --datasetDIR <DIR dataset/> --outModelsDIR <DIR where you'll save the model> --modeltype <base/large> --b <batchsize> --set <set1/set2> --lr <learningRate> --datasetName <proxemics/pisc>  --rgb --pose --onlyPairRGB
      
    • RGB+Pose (Only Pair Pose)

       python3 rgb_pose_model_main_convNext.py --datasetDIR <DIR dataset/> --outModelsDIR <DIR where you'll save the model> --modeltype <base/large> --b <batchsize> --set <set1/set2> --lr <learningRate> --datasetName <proxemics/pisc>  --rgb --pose --onlyPairPose
      
    • RGB+Pose (Only Pair RGB + Only Pair Pose)

       python3 rgb_pose_model_main_convNext.py --datasetDIR <DIR dataset/> --outModelsDIR <DIR where you'll save the model> --modeltype <base/large> --b <batchsize> --set <set1/set2> --lr <learningRate> --datasetName <proxemics/pisc>  --rgb --pose --onlyPairRGB --onlyPairPose
      

Be sure to adjust the values between <...> with the specific paths and configurations required for your project.

Note: The model is trained with Concatenation Fusion by default. To use CrossAttention, just add —-crossAttention to your command line.

 

🌟 Inference - Download the best Proxemics-Net++ model we have trained for each dataset.

Here are 2 of the best Proxemics-Net++ models we have trained.

  • Proxemics dataset

    • A model with ConvNeXt Base as the backbone. This model has given the best results (see Table 1 - SOTA). It has been trained with RGB and Pose information of individuals and pairs (Full model - 6 branches). In addition, it uses the CrossAttention fusion. Download here (4.01GB)
  • PISC dataset

    • A model with ConvNeXt Base as the backbone. This model has given the best results (see Table 2 - SOTA). It has been trained with RGB information of individuals and pairs (Full model - 3 branches). In addition, it uses Concatenation fusion. Download here (1.48GB)

đŸ€© You can test these models in the Google Colab Demo we have prepared for you. Open in Colab

 

📝 Citing Proxemics-Net

If you find Proxemics-Net++ useful in your work, please consider citing the following BibTeX entry:

@article{Jimenez2024,
  author    = {Jiménez, I. and Zafra, J. and Muñoz, R. and Marín, M. J.},
  title     = {Proxemics-net++: classification of human interactions in still images},
  journal   = {Pattern Analysis and Applications},
  volume    = {27},
  number    = {1},
  pages     = {49},
  year      = {2024},
  doi       = {10.1007/s10044-024-01270-3}
}