Skip to content

Latest commit

 

History

History
137 lines (137 loc) · 91.3 KB

20210503.md

File metadata and controls

137 lines (137 loc) · 91.3 KB

ArXiv cs.CV --Mon, 3 May 2021

1.Differentiable Event Stream Simulator for Non-Rigid 3D Tracking ⬇️

This paper introduces the first differentiable simulator of event streams, i.e., streams of asynchronous brightness change signals recorded by event cameras. Our differentiable simulator enables non-rigid 3D tracking of deformable objects (such as human hands, isometric surfaces and general watertight meshes) from event streams by leveraging an analysis-by-synthesis principle. So far, event-based tracking and reconstruction of non-rigid objects in 3D, like hands and body, has been either tackled using explicit event trajectories or large-scale datasets. In contrast, our method does not require any such processing or data, and can be readily applied to incoming event streams. We show the effectiveness of our approach for various types of non-rigid objects and compare to existing methods for non-rigid 3D tracking. In our experiments, the proposed energy-based formulations outperform competing RGB-based methods in terms of 3D errors. The source code and the new data are publicly available.

2.Deep Multi-View Stereo gone wild ⬇️

Deep multi-view stereo (deep MVS) methods have been developed and extensively compared on simple datasets, where they now outperform classical approaches. In this paper, we ask whether the conclusions reached in controlled scenarios are still valid when working with Internet photo collections. We propose a methodology for evaluation and explore the influence of three aspects of deep MVS methods: network architecture, training data, and supervision. We make several key observations, which we extensively validate quantitatively and qualitatively, both for depth prediction and complete 3D reconstructions. First, we outline the promises of unsupervised techniques by introducing a simple approach which provides more complete reconstructions than supervised options when using a simple network architecture. Second, we emphasize that not all multiscale architectures generalize to the unconstrained scenario, especially without supervision. Finally, we show the efficiency of noisy supervision from large-scale 3D reconstructions which can even lead to networks that outperform classical methods in scenarios where very few images are available.

3.Semantic Relation Preserving Knowledge Distillation for Image-to-Image Translation ⬇️

Generative adversarial networks (GANs) have shown significant potential in modeling high dimensional distributions of image data, especially on image-to-image translation tasks. However, due to the complexity of these tasks, state-of-the-art models often contain a tremendous amount of parameters, which results in large model size and long inference time. In this work, we propose a novel method to address this problem by applying knowledge distillation together with distillation of a semantic relation preserving matrix. This matrix, derived from the teacher's feature encoding, helps the student model learn better semantic relations. In contrast to existing compression methods designed for classification tasks, our proposed method adapts well to the image-to-image translation task on GANs. Experiments conducted on 5 different datasets and 3 different pairs of teacher and student models provide strong evidence that our methods achieve impressive results both qualitatively and quantitatively.

4.A Good Image Generator Is What You Need for High-Resolution Video Synthesis ⬇️

Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving image-based models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques. Code will be released at this https URL.

5.Black-box adversarial attacks using Evolution Strategies ⬇️

In the last decade, deep neural networks have proven to be very powerful in computer vision tasks, starting a revolution in the computer vision and machine learning fields. However, deep neural networks, usually, are not robust to perturbations of the input data. In fact, several studies showed that slightly changing the content of the images can cause a dramatic decrease in the accuracy of the attacked neural network. Several methods able to generate adversarial samples make use of gradients, which usually are not available to an attacker in real-world scenarios. As opposed to this class of attacks, another class of adversarial attacks, called black-box adversarial attacks, emerged, which does not make use of information on the gradients, being more suitable for real-world attack scenarios. In this work, we compare three well-known evolution strategies on the generation of black-box adversarial attacks for image classification tasks. While our results show that the attacked neural networks can be, in most cases, easily fooled by all the algorithms under comparison, they also show that some black-box optimization algorithms may be better in "harder" setups, both in terms of attack success rate and efficiency (i.e., number of queries).

6.DriveGAN: Towards a Controllable High-Quality Neural Simulation ⬇️

Realistic simulators are critical for training and verifying robotics systems. While most of the contemporary simulators are hand-crafted, a scaleable way to build simulators is to use machine learning to learn how the environment behaves in response to an action, directly from data. In this work, we aim to learn to simulate a dynamic environment directly in pixel-space, by watching unannotated sequences of frames and their associated action pairs. We introduce a novel high-quality neural simulator referred to as DriveGAN that achieves controllability by disentangling different components without supervision. In addition to steering controls, it also includes controls for sampling features of a scene, such as the weather as well as the location of non-player objects. Since DriveGAN is a fully differentiable simulator, it further allows for re-simulation of a given video sequence, offering an agent to drive through a recorded scene again, possibly taking different actions. We train DriveGAN on multiple datasets, including 160 hours of real-world driving data. We showcase that our approach greatly surpasses the performance of previous data-driven simulators, and allows for new features not explored before.

7.Updatable Siamese Tracker with Two-stage One-shot Learning ⬇️

Offline Siamese networks have achieved very promising tracking performance, especially in accuracy and efficiency. However, they often fail to track an object in complex scenes due to the incapacity in online update. Traditional updaters are difficult to process the irregular variations and sampling noises of objects, so it is quite risky to adopt them to update Siamese networks. In this paper, we first present a two-stage one-shot learner, which can predict the local parameters of primary classifier with object samples from diverse stages. Then, an updatable Siamese network is proposed based on the learner (SiamTOL), which is able to complement online update by itself. Concretely, we introduce an extra inputting branch to sequentially capture the latest object features, and design a residual module to update the initial exemplar using these features. Besides, an effective multi-aspect training loss is designed for our network to avoid overfit. Extensive experimental results on several popular benchmarks including OTB100, VOT2018, VOT2019, LaSOT, UAV123 and GOT10k manifest that the proposed tracker achieves the leading performance and outperforms other state-of-the-art methods

8.Post-training deep neural network pruning via layer-wise calibration ⬇️

We present a post-training weight pruning method for deep neural networks that achieves accuracy levels tolerable for the production setting and that is sufficiently fast to be run on commodity hardware such as desktop CPUs or edge devices. We propose a data-free extension of the approach for computer vision models based on automatically-generated synthetic fractal images. We obtain state-of-the-art results for data-free neural network pruning, with ~1.5% top@1 accuracy drop for a ResNet50 on ImageNet at 50% sparsity rate. When using real data, we are able to get a ResNet50 model on ImageNet with 65% sparsity rate in 8-bit precision in a post-training setting with a ~1% top@1 accuracy drop. We release the code as a part of the OpenVINO(TM) Post-Training Optimization tool.

9.Deep Image Destruction: A Comprehensive Study on Vulnerability of Deep Image-to-Image Models against Adversarial Attacks ⬇️

Recently, the vulnerability of deep image classification models to adversarial attacks has been investigated. However, such an issue has not been thoroughly studied for image-to-image models that can have different characteristics in quantitative evaluation, consequences of attacks, and defense strategy. To tackle this, we present comprehensive investigations into the vulnerability of deep image-to-image models to adversarial attacks. For five popular image-to-image tasks, 16 deep models are analyzed from various standpoints such as output quality degradation due to attacks, transferability of adversarial examples across different tasks, and characteristics of perturbations. We show that unlike in image classification tasks, the performance degradation on image-to-image tasks can largely differ depending on various factors, e.g., attack methods and task objectives. In addition, we analyze the effectiveness of conventional defense methods used for classification models in improving the robustness of the image-to-image models.

10.RR-Net: Injecting Interactive Semantics in Human-Object Interaction Detection ⬇️

Human-Object Interaction (HOI) detection devotes to learn how humans interact with surrounding objects. Latest end-to-end HOI detectors are short of relation reasoning, which leads to inability to learn HOI-specific interactive semantics for predictions. In this paper, we therefore propose novel relation reasoning for HOI detection. We first present a progressive Relation-aware Frame, which brings a new structure and parameter sharing pattern for interaction inference. Upon the frame, an Interaction Intensifier Module and a Correlation Parsing Module are carefully designed, where: a) interactive semantics from humans can be exploited and passed to objects to intensify interactions, b) interactive correlations among humans, objects and interactions are integrated to promote predictions. Based on modules above, we construct an end-to-end trainable framework named Relation Reasoning Network (abbr. RR-Net). Extensive experiments show that our proposed RR-Net sets a new state-of-the-art on both V-COCO and HICO-DET benchmarks and improves the baseline about 5.5% and 9.8% relatively, validating that this first effort in exploring relation reasoning and integrating interactive semantics has brought obvious improvement for end-to-end HOI detection.

11.Interpretable Semantic Photo Geolocalization ⬇️

Planet-scale photo geolocalization is the complex task of estimating the location depicted in an image solely based on its visual content. Due to the success of convolutional neural networks (CNNs), current approaches achieve super-human performance. However, previous work has exclusively focused on optimizing geolocalization accuracy. Moreover, due to the black-box property of deep learning systems, their predictions are difficult to validate for humans. State-of-the-art methods treat the task as a classification problem, where the choice of the classes, that is the partitioning of the world map, is the key for success. In this paper, we present two contributions in order to improve the interpretability of a geolocalization model: (1) We propose a novel, semantic partitioning method which intuitively leads to an improved understanding of the predictions, while at the same time state-of-the-art results are achieved for geolocational accuracy on benchmark test sets; (2) We introduce a novel metric to assess the importance of semantic visual concepts for a certain prediction to provide additional interpretable information, which allows for a large-scale analysis of already trained models.

12.CAT: Cross-Attention Transformer for One-Shot Object Detection ⬇️

Given a query patch from a novel class, one-shot object detection aims to detect all instances of that class in a target image through the semantic similarity comparison. However, due to the extremely limited guidance in the novel class as well as the unseen appearance difference between query and target instances, it is difficult to appropriately exploit their semantic similarity and generalize well. To mitigate this problem, we present a universal Cross-Attention Transformer (CAT) module for accurate and efficient semantic similarity comparison in one-shot object detection. The proposed CAT utilizes transformer mechanism to comprehensively capture bi-directional correspondence between any paired pixels from the query and the target image, which empowers us to sufficiently exploit their semantic characteristics for accurate similarity comparison. In addition, the proposed CAT enables feature dimensionality compression for inference speedup without performance loss. Extensive experiments on COCO, VOC, and FSOD under one-shot settings demonstrate the effectiveness and efficiency of our method, e.g., it surpasses CoAE, a major baseline in this task by 1.0% in AP on COCO and runs nearly 2.5 times faster. Code will be available in the future.

13.Using brain inspired principles to unsupervisedly learn good representations for visual pattern recognition ⬇️

Although deep learning has solved difficult problems in visual pattern recognition, it is mostly successful in tasks where there are lots of labeled training data available. Furthermore, the global back-propagation based training rule and the amount of employed layers represents a departure from biological inspiration. The brain is able to perform most of these tasks in a very general way from limited to no labeled data. For these reasons it is still a key research question to look into computational principles in the brain that can help guide models to unsupervisedly learn good representations which can then be used to perform tasks like classification. In this work we explore some of these principles to generate such representations for the MNIST data set. We compare the obtained results with similar recent works and verify extremely competitive results.

14.Unsupervised data augmentation for object detection ⬇️

Data augmentation has always been an effective way to overcome overfitting issue when the dataset is small. There are already lots of augmentation operations such as horizontal flip, random crop or even Mixup. However, unlike image classification task, we cannot simply perform these operations for object detection task because of the lack of labeled bounding boxes information for corresponding generated images. To address this challenge, we propose a framework making use of Generative Adversarial Networks(GAN) to perform unsupervised data augmentation. To be specific, based on the recently supreme performance of YOLOv4, we propose a two-step pipeline that enables us to generate an image where the object lies in a certain position. In this way, we can accomplish the goal that generating an image with bounding box label.

15.Deep learning with self-supervision and uncertainty regularization to count fish in underwater images ⬇️

Effective conservation actions require effective population monitoring. However, accurately counting animals in the wild to inform conservation decision-making is difficult. Monitoring populations through image sampling has made data collection cheaper, wide-reaching and less intrusive but created a need to process and analyse this data efficiently. Counting animals from such data is challenging, particularly when densely packed in noisy images. Attempting this manually is slow and expensive, while traditional computer vision methods are limited in their generalisability. Deep learning is the state-of-the-art method for many computer vision tasks, but it has yet to be properly explored to count animals. To this end, we employ deep learning, with a density-based regression approach, to count fish in low-resolution sonar images. We introduce a large dataset of sonar videos, deployed to record wild mullet schools (Mugil liza), with a subset of 500 labelled images. We utilise abundant unlabelled data in a self-supervised task to improve the supervised counting task. For the first time in this context, by introducing uncertainty quantification, we improve model training and provide an accompanying measure of prediction uncertainty for more informed biological decision-making. Finally, we demonstrate the generalisability of our proposed counting framework through testing it on a recent benchmark dataset of high-resolution annotated underwater images from varying habitats (DeepFish). From experiments on both contrasting datasets, we demonstrate our network outperforms the few other deep learning models implemented for solving this task. By providing an open-source framework along with training data, our study puts forth an efficient deep learning template for crowd counting aquatic animals thereby contributing effective methods to assess natural populations from the ever-increasing visual data.

16.Determining Chess Game State From an Image ⬇️

Identifying the configuration of chess pieces from an image of a chessboard is a problem in computer vision that has not yet been solved accurately. However, it is important for helping amateur chess players improve their games by facilitating automatic computer analysis without the overhead of manually entering the pieces. Current approaches are limited by the lack of large datasets and are not designed to adapt to unseen chess sets. This paper puts forth a new dataset synthesised from a 3D model that is an order of magnitude larger than existing ones. Trained on this dataset, a novel end-to-end chess recognition system is presented that combines traditional computer vision techniques with deep learning. It localises the chessboard using a RANSAC-based algorithm that computes a projective transformation of the board onto a regular grid. Using two convolutional neural networks, it then predicts an occupancy mask for the squares in the warped image and finally classifies the pieces. The described system achieves an error rate of 0.23% per square on the test set, 28 times better than the current state of the art. Further, a few-shot transfer learning approach is developed that is able to adapt the inference system to a previously unseen chess set using just two photos of the starting position, obtaining a per-square accuracy of 99.83% on images of that new chess set. The dataset is released publicly; code and trained models are available at this https URL.

17.SRDiff: Single Image Super-Resolution with Diffusion Probabilistic Models ⬇️

Single image super-resolution (SISR) aims to reconstruct high-resolution (HR) images from the given low-resolution (LR) ones, which is an ill-posed problem because one LR image corresponds to multiple HR images. Recently, learning-based SISR methods have greatly outperformed traditional ones, while suffering from over-smoothing, mode collapse or large model footprint issues for PSNR-oriented, GAN-driven and flow-based methods respectively. To solve these problems, we propose a novel single image super-resolution diffusion probabilistic model (SRDiff), which is the first diffusion-based model for SISR. SRDiff is optimized with a variant of the variational bound on the data likelihood and can provide diverse and realistic SR predictions by gradually transforming the Gaussian noise into a super-resolution (SR) image conditioned on an LR input through a Markov chain. In addition, we introduce residual prediction to the whole framework to speed up convergence. Our extensive experiments on facial and general benchmarks (CelebA and DIV2K datasets) show that 1) SRDiff can generate diverse SR results in rich details with state-of-the-art performance, given only one LR input; 2) SRDiff is easy to train with a small footprint; and 3) SRDiff can perform flexible image manipulation including latent space interpolation and content fusion.

18.Anomaly Detection with Prototype-Guided Discriminative Latent Embeddings ⬇️

Recent efforts towards video anomaly detection try to learn a deep autoencoder to describe normal event patterns with small reconstruction errors. The video inputs with large reconstruction errors are regarded as anomalies at the test time. However, these methods sometimes reconstruct abnormal inputs well because of the powerful generalization ability of deep autoencoder. To address this problem, we present a novel approach for anomaly detection, which utilizes discriminative prototypes of normal data to reconstruct video frames. In this way, the model will favor the reconstruction of normal events and distort the reconstruction of abnormal events. Specifically, we use a prototype-guided memory module to perform discriminative latent embedding. We introduce a new discriminative criterion for the memory module, as well as a loss function correspondingly, which can encourage memory items to record the representative embeddings of normal data, i.e. prototypes. Besides, we design a novel two-branch autoencoder, which is composed of a future frame prediction network and an RGB difference generation network that share the same encoder. The stacked RGB difference contains motion information just like optical flow, so our model can learn temporal regularity. We evaluate the effectiveness of our method on three benchmark datasets and experimental results demonstrate the proposed method outperforms the state-of-the-art.

19.Evaluating Contrastive Models for Instance-based Image Retrieval ⬇️

In this work, we evaluate contrastive models for the task of image retrieval. We hypothesise that models that are learned to encode semantic similarity among instances via discriminative learning should perform well on the task of image retrieval, where relevancy is defined in terms of instances of the same object. Through our extensive evaluation, we find that representations from models trained using contrastive methods perform on-par with (and outperforms) a pre-trained supervised baseline trained on the ImageNet labels in retrieval tasks under various configurations. This is remarkable given that the contrastive models require no explicit supervision. Thus, we conclude that these models can be used to bootstrap base models to build more robust image retrieval engines.

20.Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification ⬇️

Video-based person re-identification (re-ID) is an important research topic in computer vision. The key to tackling the challenging task is to exploit both spatial and temporal clues in video sequences. In this work, we propose a novel graph-based framework, namely Multi-Granular Hypergraph (MGH), to pursue better representational capabilities by modeling spatiotemporal dependencies in terms of multiple granularities. Specifically, hypergraphs with different spatial granularities are constructed using various levels of part-based features across the video sequence. In each hypergraph, different temporal granularities are captured by hyperedges that connect a set of graph nodes (i.e., part-based features) across different temporal ranges. Two critical issues (misalignment and occlusion) are explicitly addressed by the proposed hypergraph propagation and feature aggregation schemes. Finally, we further enhance the overall video representation by learning more diversified graph-level representations of multiple granularities based on mutual information minimization. Extensive experiments on three widely adopted benchmarks clearly demonstrate the effectiveness of the proposed framework. Notably, 90.0% top-1 accuracy on MARS is achieved using MGH, outperforming the state-of-the-arts. Code is available at this https URL.

21.Deep Learning Based Steel Pipe Weld Defect Detection ⬇️

Steel pipes are widely used in high-risk and high-pressure scenarios such as oil, chemical, natural gas, shale gas, etc. If there is some defect in steel pipes, it will lead to serious adverse consequences. Applying object detection in the field of deep learning to pipe weld defect detection and identification can effectively improve inspection efficiency and promote the development of industrial automation. Most predecessors used traditional computer vision methods applied to detect defects of steel pipe weld seams. However, traditional computer vision methods rely on prior knowledge and can only detect defects with a single feature, so it is difficult to complete the task of multi-defect classification, while deep learning is end-to-end. In this paper, the state-of-the-art single-stage object detection algorithm YOLOv5 is proposed to be applied to the field of steel pipe weld defect detection, and compared with the two-stage representative object detection algorithm Faster R-CNN. The experimental results show that applying YOLOv5 to steel pipe weld defect detection can greatly improve the accuracy, complete the multi-classification task, and meet the criteria of real-time detection.

22.Vehicle Re-identification Method Based on Vehicle Attribute and Mutual Exclusion Between Cameras ⬇️

Vehicle Re-identification aims to identify a specific vehicle across time and camera view. With the rapid growth of intelligent transportation systems and smart cities, vehicle Re-identification technology gets more and more attention. However, due to the difference of shooting angle and the high similarity of vehicles belonging to the same brand, vehicle re-identification becomes a great challenge for existing method. In this paper, we propose a vehicle attribute-guided method to re-rank vehicle Re-ID result. The attributes used include vehicle orientation and vehicle brand . We also focus on the camera information and introduce camera mutual exclusion theory to further fine-tune the search results. In terms of feature extraction, we combine the data augmentations of multi-resolutions with the large model ensemble to get a more robust vehicle features. Our method achieves mAP of 63.73% and rank-1 accuracy 76.61% in the CVPR 2021 AI City Challenge.

23.Action in Mind: A Neural Network Approach to Action Recognition and Segmentation ⬇️

Recognizing and categorizing human actions is an important task with applications in various fields such as human-robot interaction, video analysis, surveillance, video retrieval, health care system and entertainment industry. This thesis presents a novel computational approach for human action recognition through different implementations of multi-layer architectures based on artificial neural networks. Each system level development is designed to solve different aspects of the action recognition problem including online real-time processing, action segmentation and the involvement of objects. The analysis of the experimental results are illustrated and described in six articles. The proposed action recognition architecture of this thesis is composed of several processing layers including a preprocessing layer, an ordered vector representation layer and three layers of neural networks. It utilizes self-organizing neural networks such as Kohonen feature maps and growing grids as the main neural network layers. Thus the architecture presents a biological plausible approach with certain features such as topographic organization of the neurons, lateral interactions, semi-supervised learning and the ability to represent high dimensional input space in lower dimensional maps. For each level of development the system is trained with the input data consisting of consecutive 3D body postures and tested with generalized input data that the system has never met before. The experimental results of different system level developments show that the system performs well with quite high accuracy for recognizing human actions.

24.NTIRE 2021 Challenge on Image Deblurring ⬇️

Motion blur is a common photography artifact in dynamic environments that typically comes jointly with the other types of degradation. This paper reviews the NTIRE 2021 Challenge on Image Deblurring. In this challenge report, we describe the challenge specifics and the evaluation results from the 2 competition tracks with the proposed solutions. While both the tracks aim to recover a high-quality clean image from a blurry image, different artifacts are jointly involved. In track 1, the blurry images are in a low resolution while track 2 images are compressed in JPEG format. In each competition, there were 338 and 238 registered participants and in the final testing phase, 18 and 17 teams competed. The winning methods demonstrate the state-of-the-art performance on the image deblurring task with the jointly combined artifacts.

25.NTIRE 2021 Challenge on Video Super-Resolution ⬇️

Super-Resolution (SR) is a fundamental computer vision task that aims to obtain a high-resolution clean image from the given low-resolution counterpart. This paper reviews the NTIRE 2021 Challenge on Video Super-Resolution. We present evaluation results from two competition tracks as well as the proposed solutions. Track 1 aims to develop conventional video SR methods focusing on the restoration quality. Track 2 assumes a more challenging environment with lower frame rates, casting spatio-temporal SR problem. In each competition, 247 and 223 participants have registered, respectively. During the final testing phase, 14 teams competed in each track to achieve state-of-the-art performance on video SR tasks.

26.RobustFusion: Robust Volumetric Performance Reconstruction under Human-object Interactions from Monocular RGBD Stream ⬇️

High-quality 4D reconstruction of human performance with complex interactions to various objects is essential in real-world scenarios, which enables numerous immersive VR/AR applications. However, recent advances still fail to provide reliable performance reconstruction, suffering from challenging interaction patterns and severe occlusions, especially for the monocular setting. To fill this gap, in this paper, we propose RobustFusion, a robust volumetric performance reconstruction system for human-object interaction scenarios using only a single RGBD sensor, which combines various data-driven visual and interaction cues to handle the complex interaction patterns and severe occlusions. We propose a semantic-aware scene decoupling scheme to model the occlusions explicitly, with a segmentation refinement and robust object tracking to prevent disentanglement uncertainty and maintain temporal consistency. We further introduce a robust performance capture scheme with the aid of various data-driven cues, which not only enables re-initialization ability, but also models the complex human-object interaction patterns in a data-driven manner. To this end, we introduce a spatial relation prior to prevent implausible intersections, as well as data-driven interaction cues to maintain natural motions, especially for those regions under severe human-object occlusions. We also adopt an adaptive fusion scheme for temporally coherent human-object reconstruction with occlusion analysis and human parsing cue. Extensive experiments demonstrate the effectiveness of our approach to achieve high-quality 4D human performance reconstruction under complex human-object interactions whilst still maintaining the lightweight monocular setting.

27.Multi Voxel-Point Neurons Convolution (MVPConv) for Fast and Accurate 3D Deep Learning ⬇️

We present a new convolutional neural network, called Multi Voxel-Point Neurons Convolution (MVPConv), for fast and accurate 3D deep learning. The previous works adopt either individual point-based features or local-neighboring voxel-based features to process 3D model, which limits the performance of models due to the inefficient computation. Moreover, most of the existing 3D deep learning frameworks aim at solving one specific task, and only a few of them can handle a variety of tasks. Integrating both the advantages of the voxel and point-based methods, the proposed MVPConv can effectively increase the neighboring collection between point-based features and also promote the independence among voxel-based features. Simply replacing the corresponding convolution module with MVPConv, we show that MVPConv can fit in different backbones to solve a wide range of 3D tasks. Extensive experiments on benchmark datasets such as ShapeNet Part, S3DIS and KITTI for various tasks show that MVPConv improves the accuracy of the backbone (PointNet) by up to 36%, and achieves higher accuracy than the voxel-based model with up to 34 times speedup. In addition, MVPConv also outperforms the state-of-the-art point-based models with up to 8 times speedup. Notably, our MVPConv achieves better accuracy than the newest point-voxel-based model PVCNN (a model more efficient than PointNet) with lower latency.

28.SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation ⬇️

State-of-the-art semantic or instance segmentation deep neural networks (DNNs) are usually trained on a closed set of semantic classes. As such, they are ill-equipped to handle previously-unseen objects. However, detecting and localizing such objects is crucial for safety-critical applications such as perception for automated driving, especially if they appear on the road ahead. While some methods have tackled the tasks of anomalous or out-of-distribution object segmentation, progress remains slow, in large part due to the lack of solid benchmarks; existing datasets either consist of synthetic data, or suffer from label inconsistencies. In this paper, we bridge this gap by introducing the "SegmentMeIfYouCan" benchmark. Our benchmark addresses two tasks: Anomalous object segmentation, which considers any previously-unseen object category; and road obstacle segmentation, which focuses on any object on the road, may it be known or unknown. We provide two corresponding datasets together with a test suite performing an in-depth method analysis, considering both established pixel-wise performance metrics and recent component-wise ones, which are insensitive to object sizes. We empirically evaluate multiple state-of-the-art baseline methods, including several specifically designed for anomaly / obstacle segmentation, on our datasets as well as on public ones, using our benchmark suite. The anomaly and obstacle segmentation results show that our datasets contribute to the diversity and challengingness of both dataset landscapes.

29.GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions ⬇️

Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation. Existing works typically experiment on simple or small datasets, where the generalization ability is quite limited. In this work, we propose GODIVA, an open-domain text-to-video pretrained model that can generate videos from text in an auto-regressive manner using a three-dimensional sparse attention mechanism. We pretrain our model on Howto100M, a large-scale text-video dataset that contains more than 136 million text-video pairs. Experiments show that GODIVA not only can be fine-tuned on downstream video generation tasks, but also has a good zero-shot capability on unseen texts. We also propose a new metric called Relative Matching (RM) to automatically evaluate the video generation quality. Several challenges are listed and discussed as future work.

30.Few-Shot Video Object Detection ⬇️

We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions: 1) a large-scale video dataset FSVOD-500 comprising of 500 classes with class-balanced videos in each category for few-shot learning; 2) a novel Tube Proposal Network (TPN) to generate high-quality video tube proposals to aggregate feature representation for the target video object; 3) a strategically improved Temporal Matching Network (TMN+) to match representative query tube features and supports with better discriminative ability. Our TPN and TMN+ are jointly and end-to-end trained. Extensive experiments demonstrate that our method produces significantly better detection results on two few-shot video object detection datasets compared to image-based methods and other naive video-based extensions. Codes and datasets will be released at this https URL.

31.Editable Free-viewpoint Video Using a Layered Neural Representation ⬇️

Generating free-viewpoint videos is critical for immersive VR/AR experience but recent neural advances still lack the editing ability to manipulate the visual perception for large dynamic scenes. To fill this gap, in this paper we propose the first approach for editable photo-realistic free-viewpoint video generation for large-scale dynamic scenes using only sparse 16 cameras. The core of our approach is a new layered neural representation, where each dynamic entity including the environment itself is formulated into a space-time coherent neural layered radiance representation called ST-NeRF. Such layered representation supports fully perception and realistic manipulation of the dynamic scene whilst still supporting a free viewing experience in a wide range. In our ST-NeRF, the dynamic entity/layer is represented as continuous functions, which achieves the disentanglement of location, deformation as well as the appearance of the dynamic entity in a continuous and self-supervised manner. We propose a scene parsing 4D label map tracking to disentangle the spatial information explicitly, and a continuous deform module to disentangle the temporal motion implicitly. An object-aware volume rendering scheme is further introduced for the re-assembling of all the neural layers. We adopt a novel layered loss and motion-aware ray sampling strategy to enable efficient training for a large dynamic scene with multiple performers, Our framework further enables a variety of editing functions, i.e., manipulating the scale and location, duplicating or retiming individual neural layers to create numerous visual effects while preserving high realism. Extensive experiments demonstrate the effectiveness of our approach to achieve high-quality, photo-realistic, and editable free-viewpoint video generation for dynamic scenes.

32.BiCnet-TKS: Learning Efficient Spatial-Temporal Representation for Video Person Re-Identification ⬇️

In this paper, we present an efficient spatial-temporal representation for video person re-identification (reID). Firstly, we propose a Bilateral Complementary Network (BiCnet) for spatial complementarity modeling. Specifically, BiCnet contains two branches. Detail Branch processes frames at original resolution to preserve the detailed visual clues, and Context Branch with a down-sampling strategy is employed to capture long-range contexts. On each branch, BiCnet appends multiple parallel and diverse attention modules to discover divergent body parts for consecutive frames, so as to obtain an integral characteristic of target identity. Furthermore, a Temporal Kernel Selection (TKS) block is designed to capture short-term as well as long-term temporal relations by an adaptive mode. TKS can be inserted into BiCnet at any depth to construct BiCnetTKS for spatial-temporal modeling. Experimental results on multiple benchmarks show that BiCnet-TKS outperforms state-of-the-arts with about 50% less computations. The source code is available at this https URL blue-blue272/BiCnet-TKS.

33.Cleaning Label Noise with Clusters for Minimally Supervised Anomaly Detection ⬇️

Learning to detect real-world anomalous events using video-level annotations is a difficult task mainly because of the noise present in labels. An anomalous labelled video may actually contain anomaly only in a short duration while the rest of the video can be normal. In the current work, we formulate a weakly supervised anomaly detection method that is trained using only video-level labels. To this end, we propose to utilize binary clustering which helps in mitigating the noise present in the labels of anomalous videos. Our formulation encourages both the main network and the clustering to complement each other in achieving the goal of weakly supervised training. The proposed method yields 78.27% and 84.16% frame-level AUC on UCF-crime and ShanghaiTech datasets respectively, demonstrating its superiority over existing state-of-the-art algorithms.

34.PointLIE: Locally Invertible Embedding for Point Cloud Sampling and Recovery ⬇️

Point Cloud Sampling and Recovery (PCSR) is critical for massive real-time point cloud collection and processing since raw data usually requires large storage and computation. In this paper, we address a fundamental problem in PCSR: How to downsample the dense point cloud with arbitrary scales while preserving the local topology of discarding points in a case-agnostic manner (i.e. without additional storage for point relationship)? We propose a novel Locally Invertible Embedding for point cloud adaptive sampling and recovery (PointLIE). Instead of learning to predict the underlying geometry details in a seemingly plausible manner, PointLIE unifies point cloud sampling and upsampling to one single framework through bi-directional learning. Specifically, PointLIE recursively samples and adjusts neighboring points on each scale. Then it encodes the neighboring offsets of sampled points to a latent space and thus decouples the sampled points and the corresponding local geometric relationship. Once the latent space is determined and that the deep model is optimized, the recovery process could be conducted by passing the recover-pleasing sampled points and a randomly-drawn embedding to the same network through an invertible operation. Such a scheme could guarantee the fidelity of dense point recovery from sampled points. Extensive experiments demonstrate that the proposed PointLIE outperforms state-of-the-arts both quantitatively and qualitatively. Our code is released through this https URL.

35.TREND: Truncated Generalized Normal Density Estimation of Inception Embeddings for Accurate GAN Evaluation ⬇️

Evaluating image generation models such as generative adversarial networks (GANs) is a challenging problem. A common approach is to compare the distributions of the set of ground truth images and the set of generated test images. The Frechét Inception distance is one of the most widely used metrics for evaluation of GANs, which assumes that the features from a trained Inception model for a set of images follow a normal distribution. In this paper, we argue that this is an over-simplified assumption, which may lead to unreliable evaluation results, and more accurate density estimation can be achieved using a truncated generalized normal distribution. Based on this, we propose a novel metric for accurate evaluation of GANs, named TREND (TRuncated gEneralized Normal Density estimation of inception embeddings). We demonstrate that our approach significantly reduces errors of density estimation, which consequently eliminates the risk of faulty evaluation results. Furthermore, we show that the proposed metric significantly improves robustness of evaluation results against variation of the number of image samples.

36.CoCon: Cooperative-Contrastive Learning ⬇️

Labeling videos at scale is impractical. Consequently, self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge. However, when applied to real-world videos, contrastive learning may unknowingly lead to the separation of instances that contain semantically similar events. In our work, we introduce a cooperative variant of contrastive learning to utilize complementary information across views and address this issue. We use data-driven sampling to leverage implicit relationships between multiple input video views, whether observed (e.g. RGB) or inferred (e.g. flow, segmentation masks, poses). We are one of the firsts to explore exploiting inter-instance relationships to drive learning. We experimentally evaluate our representations on the downstream task of action recognition. Our method achieves competitive performance on standard benchmarks (UCF101, HMDB51, Kinetics400). Furthermore, qualitative experiments illustrate that our models can capture higher-order class relationships.

37.GM-MLIC: Graph Matching based Multi-Label Image Classification ⬇️

Multi-Label Image Classification (MLIC) aims to predict a set of labels that present in an image. The key to deal with such problem is to mine the associations between image contents and labels, and further obtain the correct assignments between images and their labels. In this paper, we treat each image as a bag of instances, and reformulate the task of MLIC as an instance-label matching selection problem. To model such problem, we propose a novel deep learning framework named Graph Matching based Multi-Label Image Classification (GM-MLIC), where Graph Matching (GM) scheme is introduced owing to its excellent capability of excavating the instance and label relationship. Specifically, we first construct an instance spatial graph and a label semantic graph respectively, and then incorporate them into a constructed assignment graph by connecting each instance to all labels. Subsequently, the graph network block is adopted to aggregate and update all nodes and edges state on the assignment graph to form structured representations for each instance and label. Our network finally derives a prediction score for each instance-label correspondence and optimizes such correspondence with a weighted cross-entropy loss. Extensive experiments conducted on various image datasets demonstrate the superiority of our proposed method.

38.StyleMapGAN: Exploiting Spatial Dimensions of Latent in GAN for Real-time Image Editing ⬇️

Generative adversarial networks (GANs) synthesize realistic images from random latent vectors. Although manipulating the latent vectors controls the synthesized outputs, editing real images with GANs suffers from i) time-consuming optimization for projecting real images to the latent vectors, ii) or inaccurate embedding through an encoder. We propose StyleMapGAN: the intermediate latent space has spatial dimensions, and a spatially variant modulation replaces AdaIN. It makes the embedding through an encoder more accurate than existing optimization-based methods while maintaining the properties of GANs. Experimental results demonstrate that our method significantly outperforms state-of-the-art models in various image manipulation tasks such as local editing and image interpolation. Last but not least, conventional editing methods on GANs are still valid on our StyleMapGAN. Source code is available at this https URL.

39.Studying the Consistency and Composability of Lottery Ticket Pruning Masks ⬇️

Magnitude pruning is a common, effective technique to identify sparse subnetworks at little cost to accuracy. In this work, we ask whether a particular architecture's accuracy-sparsity tradeoff can be improved by combining pruning information across multiple runs of training. From a shared ResNet-20 initialization, we train several network copies (\emph{siblings}) to completion using different SGD data orders on CIFAR-10. While the siblings' pruning masks are naively not much more similar than chance, starting sibling training after a few epochs of shared pretraining significantly increases pruning overlap. We then choose a subnetwork by either (1) taking all weights that survive pruning in any sibling (mask union), or (2) taking only the weights that survive pruning across all siblings (mask intersection). The resulting subnetwork is retrained. Strikingly, we find that union and intersection masks perform very similarly. Both methods match the accuracy-sparsity tradeoffs of the one-shot magnitude pruning baseline, even when we combine masks from up to $k = 10$ siblings.

40.Reproducibility of "FDA: Fourier Domain Adaptation forSemantic Segmentation ⬇️

The following paper is a reproducibility report for "FDA: Fourier Domain Adaptation for Semantic Segmentation" published in the CVPR 2020 as part of the ML Reproducibility Challenge 2020. The original code was made available by the author. The well-commented version of the code containing all ablation studies performed derived from the original code along with WANDB integration is available at <this http URL> with proper instructions to execute experiments in README.

41.Center Prediction Loss for Re-identification ⬇️

The training loss function that enforces certain training sample distribution patterns plays a critical role in building a re-identification (ReID) system. Besides the basic requirement of discrimination, i.e., the features corresponding to different identities should not be mixed, additional intra-class distribution constraints, such as features from the same identities should be close to their centers, have been adopted to construct losses. Despite the advances of various new loss functions, it is still challenging to strike the balance between the need of reducing the intra-class variation and allowing certain distribution freedom. In this paper, we propose a new loss based on center predictivity, that is, a sample must be positioned in a location of the feature space such that from it we can roughly predict the location of the center of same-class samples. The prediction error is then regarded as a loss called Center Prediction Loss (CPL). We show that, without introducing additional hyper-parameters, this new loss leads to a more flexible intra-class distribution constraint while ensuring the between-class samples are well-separated. Extensive experiments on various real-world ReID datasets show that the proposed loss can achieve superior performance and can also be complementary to existing losses.

42.Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads ⬇️

Vision-and-Language (VL) pre-training has shown great potential on many related downstream tasks, such as Visual Question Answering (VQA), one of the most popular problems in the VL field. All of these pre-trained models (such as VisualBERT, ViLBERT, LXMERT and UNITER) are built with Transformer, which extends the classical attention mechanism to multiple layers and heads. To investigate why and how these models work on VQA so well, in this paper we explore the roles of individual heads and layers in Transformer models when handling $12$ different types of questions. Specifically, we manually remove (chop) heads (or layers) from a pre-trained VisualBERT model at a time, and test it on different levels of questions to record its performance. As shown in the interesting echelon shape of the result matrices, experiments reveal different heads and layers are responsible for different question types, with higher-level layers activated by higher-level visual reasoning questions. Based on this observation, we design a dynamic chopping module that can automatically remove heads and layers of the VisualBERT at an instance level when dealing with different questions. Our dynamic chopping module can effectively reduce the parameters of the original model by 50%, while only damaging the accuracy by less than 1% on the VQA task.

43.DPR-CAE: Capsule Autoencoder with Dynamic Part Representation for Image Parsing ⬇️

Parsing an image into a hierarchy of objects, parts, and relations is important and also challenging in many computer vision tasks. This paper proposes a simple and effective capsule autoencoder to address this issue, called DPR-CAE. In our approach, the encoder parses the input into a set of part capsules, including pose, intensity, and dynamic vector. The decoder introduces a novel dynamic part representation (DPR) by combining the dynamic vector and a shared template bank. These part representations are then regulated by corresponding capsules to composite the final output in an interpretable way. Besides, an extra translation-invariant module is proposed to avoid directly learning the uncertain scene-part relationship in our DPR-CAE, which makes the resulting method achieves a promising performance gain on $rm$-MNIST and $rm$-Fashion-MNIST. % to model the scene-object relationship DPR-CAE can be easily combined with the existing stacked capsule autoencoder and experimental results show it significantly improves performance in terms of unsupervised object classification. Our code is available in the Appendix.

44.Perceptual Image Quality Assessment with Transformers ⬇️

In this paper, we propose an image quality transformer (IQT) that successfully applies a transformer architecture to a perceptual full-reference image quality assessment (IQA) task. Perceptual representation becomes more important in image quality assessment. In this context, we extract the perceptual feature representations from each of input images using a convolutional neural network (CNN) backbone. The extracted feature maps are fed into the transformer encoder and decoder in order to compare a reference and distorted images. Following an approach of the transformer-based vision models, we use extra learnable quality embedding and position embedding. The output of the transformer is passed to a prediction head in order to predict a final quality score. The experimental results show that our proposed model has an outstanding performance for the standard IQA datasets. For a large-scale IQA dataset containing output images of generative model, our model also shows the promising results. The proposed IQT was ranked first among 13 participants in the NTIRE 2021 perceptual image quality assessment challenge. Our work will be an opportunity to further expand the approach for the perceptual IQA task.

45.CoSformer: Detecting Co-Salient Object with Transformers ⬇️

Co-Salient Object Detection (CoSOD) aims at simulating the human visual system to discover the common and salient objects from a group of relevant images. Recent methods typically develop sophisticated deep learning based models have greatly improved the performance of CoSOD task. But there are still two major drawbacks that need to be further addressed, 1) sub-optimal inter-image relationship modeling; 2) lacking consideration of inter-image separability. In this paper, we propose the Co-Salient Object Detection Transformer (CoSformer) network to capture both salient and common visual patterns from multiple images. By leveraging Transformer architecture, the proposed method address the influence of the input orders and greatly improve the stability of the CoSOD task. We also introduce a novel concept of inter-image separability. We construct a contrast learning scheme to modeling the inter-image separability and learn more discriminative embedding space to distinguish true common objects from noisy objects. Extensive experiments on three challenging benchmarks, i.e., CoCA, CoSOD3k, and Cosal2015, demonstrate that our CoSformer outperforms cutting-edge models and achieves the new state-of-the-art. We hope that CoSformer can motivate future research for more visual co-analysis tasks.

46.MOOD: Multi-level Out-of-distribution Detection ⬇️

Out-of-distribution (OOD) detection is essential to prevent anomalous inputs from causing a model to fail during deployment. While improved OOD detection methods have emerged, they often rely on the final layer outputs and require a full feedforward pass for any given input. In this paper, we propose a novel framework, multi-level out-of-distribution detection MOOD, which exploits intermediate classifier outputs for dynamic and efficient OOD inference. We explore and establish a direct relationship between the OOD data complexity and optimal exit level, and show that easy OOD examples can be effectively detected early without propagating to deeper layers. At each exit, the OOD examples can be distinguished through our proposed adjusted energy score, which is both empirically and theoretically suitable for networks with multiple classifiers. We extensively evaluate MOOD across 10 OOD datasets spanning a wide range of complexities. Experiments demonstrate that MOOD achieves up to 71.05% computational reduction in inference, while maintaining competitive OOD detection performance.

47.End-to-End Attention-based Image Captioning ⬇️

In this paper, we address the problem of image captioning specifically for molecular translation where the result would be a predicted chemical notation in InChI format for a given molecular structure. Current approaches mainly follow rule-based or CNN+RNN based methodology. However, they seem to underperform on noisy images and images with small number of distinguishable features. To overcome this, we propose an end-to-end transformer model. When compared to attention-based techniques, our proposed model outperforms on molecular datasets.

48.Pyramid Medical Transformer for Medical Image Segmentation ⬇️

Deep neural networks have been a prevailing technique in the field of medical image processing. However, the most popular convolutional neural networks (CNNs) based methods for medical image segmentation are imperfect because they cannot adequately model long-range pixel relations. Transformers and the self-attention mechanism are recently proposed to effectively learn long-range dependencies by modeling all pairs of word-to-word attention regardless of their positions. The idea has also been extended to the computer vision field by creating and treating image patches as embeddings. Considering the computation complexity for whole image self-attention, current transformer-based models settle for a rigid partitioning scheme that would potentially lose informative relations. Besides, current medical transformers model global context on full resolution images, leading to unnecessary computation costs. To address these issues, we developed a novel method to integrate multi-scale attention and CNN feature extraction using a pyramidal network architecture, namely Pyramid Medical Transformer (PMTrans). The PMTrans captured multi-range relations by working on multi-resolution images. An adaptive partitioning scheme was implemented to retain informative relations and to access different receptive fields efficiently. Experimental results on two medical image datasets, gland segmentation and MoNuSeg datasets, showed that PMTrans outperformed the latest CNN-based and transformer-based models for medical image segmentation.

49.Spirit Distillation: A Model Compression Method with Multi-domain Knowledge Transfer ⬇️

Recent applications pose requirements of both cross-domain knowledge transfer and model compression to machine learning models due to insufficient training data and limited computational resources. In this paper, we propose a new knowledge distillation model, named Spirit Distillation (SD), which is a model compression method with multi-domain knowledge transfer. The compact student network mimics out a representation equivalent to the front part of the teacher network, through which the general knowledge can be transferred from the source domain (teacher) to the target domain (student). To further improve the robustness of the student, we extend SD to Enhanced Spirit Distillation (ESD) in exploiting a more comprehensive knowledge by introducing the proximity domain which is similar to the target domain for feature extraction. Results demonstrate that our method can boost mIOU and high-precision accuracy by 1.4% and 8.2% respectively with 78.2% segmentation variance, and can gain a precise compact network with only 41.8% FLOPs.

50.Analysis of Manual and Automated Skin Tone Assignments for Face Recognition Applications ⬇️

News reports have suggested that darker skin tone causes an increase in face recognition errors. The Fitzpatrick scale is widely used in dermatology to classify sensitivity to sun exposure and skin tone. In this paper, we analyze a set of manual Fitzpatrick skin type assignments and also employ the individual typology angle to automatically estimate the skin tone from face images. The set of manual skin tone rating experiments shows that there are inconsistencies between human raters that are difficult to eliminate. Efforts to automate skin tone rating suggest that it is particularly challenging on images collected without a calibration object in the scene. However, after the color-correction, the level of agreement between automated and manual approaches is found to be 96% or better for the MORPH images. To our knowledge, this is the first work to: (a) examine the consistency of manual skin tone ratings across observers, (b) document that there is substantial variation in the rating of the same image by different observers even when exemplar images are given for guidance and all images are color-corrected, and (c) compare manual versus automated skin tone ratings.

51.EagerMOT: 3D Multi-Object Tracking via Sensor Fusion ⬇️

Multi-object tracking (MOT) enables mobile robots to perform well-informed motion planning and navigation by localizing surrounding objects in 3D space and time. Existing methods rely on depth sensors (e.g., LiDAR) to detect and track targets in 3D space, but only up to a limited sensing range due to the sparsity of the signal. On the other hand, cameras provide a dense and rich visual signal that helps to localize even distant objects, but only in the image domain. In this paper, we propose EagerMOT, a simple tracking formulation that eagerly integrates all available object observations from both sensor modalities to obtain a well-informed interpretation of the scene dynamics. Using images, we can identify distant incoming objects, while depth estimates allow for precise trajectory localization as soon as objects are within the depth-sensing range. With EagerMOT, we achieve state-of-the-art results across several MOT tasks on the KITTI and NuScenes datasets. Our code is available at this https URL.

52.AGORA: Avatars in Geography Optimized for Regression Analysis ⬇️

While the accuracy of 3D human pose estimation from images has steadily improved on benchmark datasets, the best methods still fail in many real-world scenarios. This suggests that there is a domain gap between current datasets and common scenes containing people. To obtain ground-truth 3D pose, current datasets limit the complexity of clothing, environmental conditions, number of subjects, and occlusion. Moreover, current datasets evaluate sparse 3D joint locations corresponding to the major joints of the body, ignoring the hand pose and the face shape. To evaluate the current state-of-the-art methods on more challenging images, and to drive the field to address new problems, we introduce AGORA, a synthetic dataset with high realism and highly accurate ground truth. Here we use 4240 commercially-available, high-quality, textured human scans in diverse poses and natural clothing; this includes 257 scans of children. We create reference 3D poses and body shapes by fitting the SMPL-X body model (with face and hands) to the 3D scans, taking into account clothing. We create around 14K training and 3K test images by rendering between 5 and 15 people per image using either image-based lighting or rendered 3D environments, taking care to make the images physically plausible and photoreal. In total, AGORA consists of 173K individual person crops. We evaluate existing state-of-the-art methods for 3D human pose estimation on this dataset and find that most methods perform poorly on images of children. Hence, we extend the SMPL-X model to better capture the shape of children. Additionally, we fine-tune methods on AGORA and show improved performance on both AGORA and 3DPW, confirming the realism of the dataset. We provide all the registered 3D reference training data, rendered images, and a web-based evaluation site at this https URL.

53.HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction ⬇️

We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. Our method starts by extracting a set of potential 2D locations for the joints of both hands as extrema of a heatmap. We do not require that all locations correctly correspond to a joint, not that all the joints are detected. We use appearance and spatial encodings of these locations as input to a transformer, and leverage the attention mechanisms to sort out the correct configuration of the joints and output the 3D poses of both hands. Our approach thus allies the recognition power of a Transformer to the accuracy of heatmap-based methods. We also show it can be extended to estimate the 3D pose of an object manipulated by one or two hands. We evaluate our approach on the recent and challenging InterHand2.6M and HO-3D datasets. We obtain 17% improvement over the baseline. Moreover, we introduce the first dataset made of action sequences of two hands manipulating an object fully annotated in 3D and will make it publicly available.

54.Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary ⬇️

With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs a fraction of the training data used by an audio-driven approach; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing, training and inference time. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.

55.Scalable Semi-supervised Landmark Localization for X-ray Images using Few-shot Deep Adaptive Graph ⬇️

Landmark localization plays an important role in medical image analysis. Learning based methods, including CNN and GCN, have demonstrated the state-of-the-art performance. However, most of these methods are fully-supervised and heavily rely on manual labeling of a large training dataset. In this paper, based on a fully-supervised graph-based method, DAG, we proposed a semi-supervised extension of it, termed few-shot DAG, \ie five-shot DAG. It first trains a DAG model on the labeled data and then fine-tunes the pre-trained model on the unlabeled data with a teacher-student SSL mechanism. In addition to the semi-supervised loss, we propose another loss using JS divergence to regulate the consistency of the intermediate feature maps. We extensively evaluated our method on pelvis, hand and chest landmark detection tasks. Our experiment results demonstrate consistent and significant improvements over previous methods.

56.AttendSeg: A Tiny Attention Condenser Neural Network for Semantic Segmentation on the Edge ⬇️

In this study, we introduce \textbf{AttendSeg}, a low-precision, highly compact deep neural network tailored for on-device semantic segmentation. AttendSeg possesses a self-attention network architecture comprising of light-weight attention condensers for improved spatial-channel selective attention at a very low complexity. The unique macro-architecture and micro-architecture design properties of AttendSeg strike a strong balance between representational power and efficiency, achieved via a machine-driven design exploration strategy tailored specifically for the task at hand. Experimental results demonstrated that the proposed AttendSeg can achieve segmentation accuracy comparable to much larger deep neural networks with greater complexity while possessing a significantly lower architecture and computational complexity (requiring as much as >27x fewer MACs, >72x fewer parameters, and >288x lower weight memory requirements), making it well-suited for TinyML applications on the edge.

57.Crack Semantic Segmentation using the U-Net with Full Attention Strategy ⬇️

Structures suffer from the emergence of cracks, therefore, crack detection is always an issue with much concern in structural health monitoring. Along with the rapid progress of deep learning technology, image semantic segmentation, an active research field, offers another solution, which is more effective and intelligent, to crack detection Through numerous artificial neural networks have been developed to address the preceding issue, corresponding explorations are never stopped improving the quality of crack detection. This paper presents a novel artificial neural network architecture named Full Attention U-net for image semantic segmentation. The proposed architecture leverages the U-net as the backbone and adopts the Full Attention Strategy, which is a synthesis of the attention mechanism and the outputs from each encoding layer in skip connection. Subject to the hardware in training, the experiments are composed of verification and validation. In verification, 4 networks including U-net, Attention U-net, Advanced Attention U-net, and Full Attention U-net are tested through cell images for a competitive study. With respect to mean intersection-over-unions and clarity of edge identification, the Full Attention U-net performs best in verification, and is hence applied for crack semantic segmentation in validation to demonstrate its effectiveness.

58.Unsupervised Layered Image Decomposition into Object Prototypes ⬇️

We present an unsupervised learning framework for decomposing images into layers of automatically discovered object models. Contrary to recent approaches that model image layers with autoencoder networks, we represent them as explicit transformations of a small set of prototypical images. Our model has three main components: (i) a set of object prototypes in the form of learnable images with a transparency channel, which we refer to as sprites; (ii) differentiable parametric functions predicting occlusions and transformation parameters necessary to instantiate the sprites in a given image; (iii) a layered image formation model with occlusion for compositing these instances into complete images including background. By jointly learning the sprites and occlusion/transformation predictors to reconstruct images, our approach not only yields accurate layered image decompositions, but also identifies object categories and instance parameters. We first validate our approach by providing results on par with the state of the art on standard multi-object synthetic benchmarks (Tetrominoes, Multi-dSprites, CLEVR6). We then demonstrate the applicability of our model to real images in tasks that include clustering (SVHN, GTSRB), cosegmentation (Weizmann Horse) and object discovery from unfiltered social network images. To the best of our knowledge, our approach is the first layered image decomposition algorithm that learns an explicit and shared concept of object type, and is robust enough to be applied to real images.

59.Faster Meta Update Strategy for Noise-Robust Deep Learning ⬇️

It has been shown that deep neural networks are prone to overfitting on biased training data. Towards addressing this issue, meta-learning employs a meta model for correcting the training bias. Despite the promising performances, super slow training is currently the bottleneck in the meta learning approaches. In this paper, we introduce a novel Faster Meta Update Strategy (FaMUS) to replace the most expensive step in the meta gradient computation with a faster layer-wise approximation. We empirically find that FaMUS yields not only a reasonably accurate but also a low-variance approximation of the meta gradient. We conduct extensive experiments to verify the proposed method on two tasks. We show our method is able to save two-thirds of the training time while still maintaining the comparable or achieving even better generalization performance. In particular, our method achieves the state-of-the-art performance on both synthetic and realistic noisy labels, and obtains promising performance on long-tailed recognition on standard benchmarks.

60.Time Series Forecasting of New Cases and New Deaths Rate for COVID-19 using Deep Learning Methods ⬇️

Covid-19 has been started in the year 2019 and imposed restrictions in many countries and costs organisations and governments. Predicting the number of new cases and deaths during this period can be a useful step in predicting the costs and facilities required in the future. The purpose of this study is to predict new cases and death rate for seven days ahead. Deep learning methods and statistical analysis model these predictions for 100 days. Six different deep learning methods are examined for the data adopted from the WHO website. Three methods are known as LSTM, Convolutional LSTM, and GRU. The bi-directional mode is then considered for each method to forecast the rate of new cases and new deaths for Australia and Iran countries. This study is novel as it attempts to implement the mentioned three deep learning methods, along with their Bi-directional models, to predict COVID-19 new cases and new death rate time series. All methods are compared, and results are presented. The results are examined in the form of graphs and statistical analyses. The results show that the Bi-directional models have lower error than other models. Several error evaluation metrics are presented to compare all models, and finally, the superiority of Bi-directional methods are determined. The experimental results and statistical test show on datasets to compare the proposed method with other baseline methods. This research could be useful for organisations working against COVID-19 and determining their long-term plans.

61.Certifying Emergency Landing for Safe Urban UAV ⬇️

Unmanned Aerial Vehicles (UAVs) have the potential to be used for many applications in urban environments. However, allowing UAVs to fly above densely populated areas raises concerns regarding safety. One of the main safety issues is the possibility for a failure to cause the loss of navigation capabilities, which can result in the UAV falling/landing in hazardous areas such as busy roads, where it can cause fatal accidents. Current standards, such as the SORA published in 2019, do not consider applicable mitigation techniques to handle this kind of hazardous situations. Consequently, certifying UAV urban operations implies to demonstrate very high levels of integrity, which results in prohibitive development costs. To address this issue, this paper explores the concept of Emergency Landing (EL). A safety analysis is conducted on an urban UAV case study, and requirements are proposed to enable the integration of EL as an acceptable mitigation mean in the SORA. Based on these requirements, an EL implementation was developed, together with a runtime monitoring architecture to enhance confidence in the system. Preliminary qualitative results are presented and the monitor seem to be able to detect errors of the EL system effectively.

62.Robust joint registration of multiple stains and MRI for multimodal 3D histology reconstruction: Application to the Allen human brain atlas ⬇️

Joint registration of a stack of 2D histological sections to recover 3D structure (3D histology reconstruction) finds application in areas such as atlas building and validation of in vivo imaging. Straighforward pairwise registration of neighbouring sections yields smooth reconstructions but has well-known problems such as banana effect (straightening of curved structures) and z-shift (drift). While these problems can be alleviated with an external, linearly aligned reference (e.g., Magnetic Resonance images), registration is often inaccurate due to contrast differences and the strong nonlinear distortion of the tissue, including artefacts such as folds and tears. In this paper, we present a probabilistic model of spatial deformation that yields reconstructions for multiple histological stains that that are jointly smooth, robust to outliers, and follow the reference shape. The model relies on a spanning tree of latent transforms connecting all the sections and slices, and assumes that the registration between any pair of images can be see as a noisy version of the composition of (possibly inverted) latent transforms connecting the two images. Bayesian inference is used to compute the most likely latent transforms given a set of pairwise registrations between image pairs within and across modalities. Results on synthetic deformations on multiple MR modalities, show that our method can accurately and robustly register multiple contrasts even in the presence of outliers. The 3D histology reconstruction of two stains (Nissl and parvalbumin) from the Allen human brain atlas, show its benefits on real data with severe distortions. We also provide the correspondence to MNI space, bridging the gap between two of the most used atlases in histology and MRI. Data is available at this https URL and code at this https URL.

63.ICOS: Efficient and Highly Robust Point Cloud Registration with Correspondences ⬇️

Point Cloud Registration is a fundamental problem in robotics and computer vision. Due to the limited accuracy in the matching process of 3D keypoints, the presence of outliers, probably in very large numbers, is common in many real-world applications. In this paper, we present ICOS (Inlier searching using COmpatible Structures), a novel, efficient and highly robust solution for the correspondence-based point cloud registration problem. Specifically, we (i) propose and construct a series of compatible structures for the registration problem where various invariants can be established, and (ii) design two time-efficient frameworks, one for known-scale registration and the other for unknown-scale registration, to filter out outliers and seek inliers from the invariant-constrained random sampling built upon the compatible structures. In this manner, even with extreme outlier ratios, inliers can be detected and collected for solving the optimal transformation, leading to our robust registration solver ICOS. Through plentiful experiments over standard real datasets, we demonstrate that: (i) our solver ICOS is fast, accurate, robust against as many as 99% outliers with nearly 100% recall ratio of inliers whether the scale is known or unknown, outperforming other state-of-the-art methods, (ii) ICOS is practical for use in real-world applications.

64.A Refined Inertial DCA for DC Programming ⬇️

We consider the difference-of-convex (DC) programming problems whose objective function is level-bounded. The classical DC algorithm (DCA) is well-known for solving this kind of problems, which returns a critical point. Recently, de Oliveira and Tcheo incorporated the inertial-force procedure into DCA (InDCA) for potential acceleration and preventing the algorithm from converging to a critical point which is not d(directional)-stationary. In this paper, based on InDCA, we propose two refined inertial DCA (RInDCA) with enlarged inertial step-sizes for better acceleration. We demonstrate the subsequential convergence of our refined versions to a critical point. In addition, by assuming the Kurdyka-Lojasiewicz (KL) property of the objective function, we establish the sequential convergence of RInDCA. Numerical simulations on image restoration problem show the benefit of enlarged step-size.

65.Physically Feasible Vehicle Trajectory Prediction ⬇️

Predicting the future motion of actors in a traffic scene is a crucial part of any autonomous driving system. Recent research in this area has focused on trajectory prediction approaches that optimize standard trajectory error metrics. In this work, we describe three important properties -- physical realism guarantees, system maintainability, and sample efficiency -- which we believe are equally important for developing a self-driving system that can operate safely and practically in the real world. Furthermore, we introduce PTNet (PathTrackingNet), a novel approach for vehicle trajectory prediction that is a hybrid of the classical pure pursuit path tracking algorithm and modern graph-based neural networks. By combining a structured robotics technique with a flexible learning approach, we are able to produce a system that not only achieves the same level of performance as other state-of-the-art methods on traditional trajectory error metrics, but also provides strong guarantees about the physical realism of the predicted trajectories while requiring half the amount of data. We believe focusing on this new class of hybrid approaches is an useful direction for developing and maintaining a safety-critical autonomous driving system.

66.End-to-End Jet Classification of Boosted Top Quarks with the CMS Open Data ⬇️

We describe a novel application of the end-to-end deep learning technique to the task of discriminating top quark-initiated jets from those originating from the hadronization of a light quark or a gluon. The end-to-end deep learning technique combines deep learning algorithms and low-level detector representation of the high-energy collision event. In this study, we use low-level detector information from the simulated CMS Open Data samples to construct the top jet classifiers. To optimize classifier performance we progressively add low-level information from the CMS tracking detector, including pixel detector reconstructed hits and impact parameters, and demonstrate the value of additional tracking information even when no new spatial structures are added. Relying only on calorimeter energy deposits and reconstructed pixel detector hits, the end-to-end classifier achieves an AUC score of 0.975$\pm$0.002 for the task of classifying boosted top quark jets. After adding derived track quantities, the classifier AUC score increases to 0.9824$\pm$0.0013, serving as the first performance benchmark for these CMS Open Data samples. We additionally provide a timing performance comparison of different processor unit architectures for training the network.

67.Lung Cancer Diagnosis Using Deep Attention Based on Multiple Instance Learning and Radiomics ⬇️

Early diagnosis of lung cancer is a key intervention for the treatment of lung cancer computer aided diagnosis (CAD) can play a crucial role. However, most published CAD methods treat lung cancer diagnosis as a lung nodule classification problem, which does not reflect clinical practice, where clinicians diagnose a patient based on a set of images of nodules, instead of one specific nodule. Besides, the low interpretability of the output provided by these methods presents an important barrier for their adoption. In this article, we treat lung cancer diagnosis as a multiple instance learning (MIL) problem in order to better reflect the diagnosis process in the clinical setting and for the higher interpretability of the output. We chose radiomics as the source of input features and deep attention-based MIL as the classification algorithm.The attention mechanism provides higher interpretability by estimating the importance of each instance in the set for the final this http URL order to improve the model's performance in a small imbalanced dataset, we introduce a new bag simulation method for MIL.The results show that our method can achieve a mean accuracy of 0.807 with a standard error of the mean (SEM) of 0.069, a recall of 0.870 (SEM 0.061), a positive predictive value of 0.928 (SEM 0.078), a negative predictive value of 0.591 (SEM 0.155) and an area under the curve (AUC) of 0.842 (SEM 0.074), outperforming other MIL methods.Additional experiments show that the proposed oversampling strategy significantly improves the model's performance. In addition, our experiments show that our method provides an indication of the importance of each nodule in determining the diagnosis, which combined with the well-defined radiomic features, make the results more interpretable and acceptable for doctors and patients.

68.Cluster-driven Graph Federated Learning over Multiple Domains ⬇️

Federated Learning (FL) deals with learning a central model (i.e. the server) in privacy-constrained scenarios, where data are stored on multiple devices (i.e. the clients). The central model has no direct access to the data, but only to the updates of the parameters computed locally by each client. This raises a problem, known as statistical heterogeneity, because the clients may have different data distributions (i.e. domains). This is only partly alleviated by clustering the clients. Clustering may reduce heterogeneity by identifying the domains, but it deprives each cluster model of the data and supervision of others. Here we propose a novel Cluster-driven Graph Federated Learning (FedCG). In FedCG, clustering serves to address statistical heterogeneity, while Graph Convolutional Networks (GCNs) enable sharing knowledge across them. FedCG: i) identifies the domains via an FL-compliant clustering and instantiates domain-specific modules (residual branches) for each domain; ii) connects the domain-specific modules through a GCN at training to learn the interactions among domains and share knowledge; and iii) learns to cluster unsupervised via teacher-student classifier-training iterations and to address novel unseen test domains via their domain soft-assignment scores. Thanks to the unique interplay of GCN over clusters, FedCG achieves the state-of-the-art on multiple FL benchmarks.