Skip to content

Latest commit

 

History

History
99 lines (99 loc) · 65.4 KB

20200615.md

File metadata and controls

99 lines (99 loc) · 65.4 KB

ArXiv cs.CV --Mon, 15 Jun 2020

1.Robust Baggage Detection and Classification Based on Local Tri-directional Pattern ⬇️

In recent decades, the automatic video surveillance system has gained significant importance in computer vision community. The crucial objective of surveillance is monitoring and security in public places. In the traditional Local Binary Pattern, the feature description is somehow inaccurate, and the feature size is large enough. Therefore, to overcome these shortcomings, our research proposed a detection algorithm for a human with or without carrying baggage. The Local tri-directional pattern descriptor is exhibited to extract features of different human body parts including head, trunk, and limbs. Then with the help of support vector machine, extracted features are trained and evaluated. Experimental results on INRIA and MSMT17 V1 datasets show that LtriDP outperforms several state-of-the-art feature descriptors and validate its effectiveness.

2.GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with Multi-Feature Learning ⬇️

3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work uses a standard tracking-by-detection pipeline, where feature extraction is first performed independently for each object in order to compute an affinity matrix. Then the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this standard pipeline is to learn discriminative features for different objects in order to reduce confusion during data association. In this work, we propose two techniques to improve the discriminative feature learning for MOT: (1) instead of obtaining features for each object independently, we propose a novel feature interaction mechanism by introducing the Graph Neural Network. As a result, the feature of one object is informed of the features of other objects so that the object feature can lean towards the object with similar feature (i.e., object probably with a same ID) and deviate from objects with dissimilar features (i.e., object probably with different IDs), leading to a more discriminative feature for each object; (2) instead of obtaining the feature from either 2D or 3D space in prior work, we propose a novel joint feature extractor to learn appearance and motion features from 2D and 3D space simultaneously. As features from different modalities often have complementary information, the joint feature can be more discriminate than feature from each individual modality. To ensure that the joint feature extractor does not heavily rely on one modality, we also propose an ensemble training paradigm. Through extensive evaluation, our proposed method achieves state-of-the-art performance on KITTI and nuScenes 3D MOT benchmarks. Our code will be made available at this https URL

3.Multiple-Vehicle Tracking in the Highway Using Appearance Model and Visual Object Tracking ⬇️

In recent decades, due to the groundbreaking improvements in machine vision, many daily tasks are performed by computers. One of these tasks is multiple-vehicle tracking, which is widely used in different areas such as video surveillance and traffic monitoring. This paper focuses on introducing an efficient novel approach with acceptable accuracy. This is achieved through an efficient appearance and motion model based on the features extracted from each object. For this purpose, two different approaches have been used to extract features, i.e. features extracted from a deep neural network, and traditional features. Then the results from these two approaches are compared with state-of-the-art trackers. The results are obtained by executing the methods on the UA-DETRACK benchmark. The first method led to 58.9% accuracy while the second method caused up to 15.9%. The proposed methods can still be improved by extracting more distinguishable features.

4.Pitfalls of the Gram Loss for Neural Texture Synthesis in Light of Deep Feature Histograms ⬇️

Neural texture synthesis and style transfer are both powered by the Gram matrix as a means to measure deep feature statistics. Despite its ubiquity, this second-order feature descriptor has several shortcomings resulting in visual artifacts, ill-defined interpolation, or inability to capture spatial constraints. Many previous works acknowledge these shortcomings but do not really explain why they occur. Fixing them is thus usually approached by adding new losses, which require parameter tuning and make the problem even more ill-defined, or architecturing complex and/or adversarial networks. In this paper, we propose a comprehensive study of these problems in the light of the multi-dimensional histograms of deep features. With the insights gained from our analysis, we show how to compute a well-defined and efficient textural loss based on histogram transformations. Our textural loss outperforms the Gram matrix in terms of quality, robustness, spatial control, and interpolation. It does not require additional learning or parameter tuning, and can be implemented in a few lines of code.

5.Local-Area-Learning Network: Meaningful Local Areas for Efficient Point Cloud Analysis ⬇️

Research in point cloud analysis with deep neural networks has made rapid progress in recent years. The pioneering work PointNet offered a direct analysis of point clouds. However, due to its architecture PointNet is not able to capture local structures. To overcome this drawback, the same authors have developed PointNet++ by applying PointNet to local areas. The local areas are defined by center points and their neighbors. In PointNet++ and its further developments the center points are determined with a Farthest Point Sampling (FPS) algorithm. This has the disadvantage that the center points in general do not have meaningful local areas. In this paper, we introduce the neural Local-Area-Learning Network (LocAL-Net) which places emphasis on the selection and characterization of the local areas. Our approach learns critical points that we use as center points. In order to strengthen the recognition of local structures, the points are given additional metric properties depending on the local areas. Finally, we derive and combine two global feature vectors, one from the whole point cloud and one from all local areas. Experiments on the datasets ModelNet10/40 and ShapeNet show that LocAL-Net is competitive for part segmentation. For classification LocAL-Net outperforms the state-of-the-arts.

6.Branch-Cooperative OSNet for Person Re-Identification ⬇️

Multi-branch is extensively studied for learning rich feature representation for person re-identification (Re-ID). In this paper, we propose a branch-cooperative architecture over OSNet, termed BC-OSNet, for person Re-ID. By stacking four cooperative branches, namely, a global branch, a local branch, a relational branch and a contrastive branch, we obtain powerful feature representation for person Re-ID. Extensive experiments show that the proposed BC-OSNet achieves state-of-art performance on the three popular datasets, including Market-1501, DukeMTMC-reID and CUHK03. In particular, it achieves mAP of 84.0% and rank-1 accuracy of 87.1% on the CUHK03_labeled.

7.Video Understanding as Machine Translation ⬇️

With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positive and negative samples often combined with hand-designed curriculum policies. In this work we remove the need for negative sampling by taking a generative modeling approach that poses the objective as a translation problem between modalities. Such a formulation allows us to tackle a wide variety of downstream video understanding tasks by means of a single unified framework, without the need for large batches of negative samples common in contrastive metric learning. We experiment with the large-scale HowTo100M dataset for training, and report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT), and text-based clip retrieval (YouCook2 and MSR-VTT).

8.ESAD: Endoscopic Surgeon Action Detection Dataset ⬇️

In this work, we take aim towards increasing the effectiveness of surgical assistant robots. We intended to make assistant robots safer by making them aware about the actions of surgeon, so it can take appropriate assisting actions. In other words, we aim to solve the problem of surgeon action detection in endoscopic videos. To this, we introduce a challenging dataset for surgeon action detection in real-world endoscopic videos. Action classes are picked based on the feedback of surgeons and annotated by medical professional. Given a video frame, we draw bounding box around surgical tool which is performing action and label it with action label. Finally, we presenta frame-level action detection baseline model based on recent advances in ob-ject detection. Results on our new dataset show that our presented dataset provides enough interesting challenges for future method and it can serveas strong benchmark corresponding research in surgeon action detection in endoscopic videos.

9.Are we done with ImageNet? ⬇️

Yes, and no. We ask whether recent progress on the ImageNet classification benchmark continues to represent meaningful generalization, or whether the community has started to overfit to the idiosyncrasies of its labeling procedure. We therefore develop a significantly more robust procedure for collecting human annotations of the ImageNet validation set. Using these new labels, we reassess the accuracy of recently proposed ImageNet classifiers, and find their gains to be substantially smaller than those reported on the original labels. Furthermore, we find the original ImageNet labels to no longer be the best predictors of this independently-collected set, indicating that their usefulness in evaluating vision models may be nearing an end. Nevertheless, we find our annotation procedure to have largely remedied the errors in the original labels, reinforcing ImageNet as a powerful benchmark for future research in visual recognition.

10.Attribute analysis with synthetic dataset for person re-identification ⬇️

Person re-identification (re-ID) plays an important role in applications such as public security and video surveillance. Recently, learning from synthetic data, which benefits from the popularity of synthetic data engine, have achieved remarkable performance. However, existing synthetic datasets are in small size and lack of diversity, which hinders the development of person re-ID in real-world scenarios. To address this problem, firstly, we develop a large-scale synthetic data engine, the salient characteristic of this engine is controllable. Based on it, we build a large-scale synthetic dataset, which are diversified and customized from different attributes, such as illumination and viewpoint. Secondly, we quantitatively analyze the influence of dataset attributes on re-ID system. To our best knowledge, this is the first attempt to explicitly dissect person re-ID from the aspect of attribute on synthetic dataset. Comprehensive experiments help us have a deeper understanding of the fundamental problems in person re-ID. Our research also provides useful insights for dataset building and future practical usage.

11.Knowledge Distillation Meets Self-Supervision ⬇️

Knowledge distillation, which involves extracting the "dark knowledge" from a teacher network to guide the learning of a student network, has emerged as an important technique for model compression and transfer learning. Unlike previous works that exploit architecture-specific cues such as activation and attention for distillation, here we wish to explore a more general and model-agnostic approach for extracting "richer dark knowledge" from the pre-trained teacher model. We show that the seemingly different self-supervision task can serve as a simple yet powerful solution. For example, when performing contrastive learning between transformed entities, the noisy predictions of the teacher network reflect its intrinsic composition of semantic and pose information. By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student. In this paper, we discuss practical ways to exploit those noisy self-supervision signals with selective transfer for distillation. We further show that self-supervision signals improve conventional distillation with substantial gains under few-shot and noisy-label scenarios. Given the richer knowledge mined from self-supervision, our knowledge distillation approach achieves state-of-the-art performance on standard benchmarks, i.e., CIFAR100 and ImageNet, under both similar-architecture and cross-architecture settings. The advantage is even more pronounced under the cross-architecture setting, where our method outperforms the state of the art CRD by an average of 2.3% in accuracy rate on CIFAR100 across six different teacher-student pairs.

12.A Face Preprocessing Approach for Improved DeepFake Detection ⬇️

Recent advancements in content generation technologies (also widely known as DeepFakes) along with the online proliferation of manipulated media and disinformation campaigns render the detection of such manipulations a task of increasing importance. There are numerous works related to DeepFake detection but there is little focus on the impact of dataset preprocessing on the detection accuracy of the models. In this paper, we focus on this aspect of the DeepFake detection task and propose a preprocessing step to improve the quality of training datasets for the problem. We also examine its effects on the DeepFake detection performance. Experimental results demonstrate that the proposed preprocessing approach leads to measurable improvements in the performance of detection models.

13.Unmasking the Inductive Biases of Unsupervised Object Representations for Video Sequences ⬇️

Perceiving the world in terms of objects is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models have been evaluated with respect to different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of individual objects. In this paper, we argue that the established evaluation protocol of multi-object tracking tests precisely these perceptual qualities and we propose a new benchmark dataset based on procedurally generated video sequences. Using this benchmark, we compare the perceptual abilities of three state-of-the-art unsupervised object-centric learning approaches. Towards this goal, we propose a video-extension of MONet, a seminal object-centric model for static scenes, and compare it to two recent video models: OP3, which exploits clustering via spatial mixture models, and TBA, which uses an explicit factorization via spatial transformers. Our results indicate that architectures which employ unconstrained latent representations based on per-object variational autoencoders and full-image object masks are able to learn more powerful representations in terms of object detection, segmentation and tracking than the explicitly parameterized spatial transformer based architecture. We also observe that none of the methods are able to gracefully handle the most challenging tracking scenarios, suggesting that our synthetic video benchmark may provide fruitful guidance towards learning more robust object-centric video representations.

14.Rethinking Sampling in 3D Point Cloud Generative Adversarial Networks ⬇️

In this paper, we examine the long-neglected yet important effects of point sampling patterns in point cloud GANs. Through extensive experiments, we show that sampling-insensitive discriminators (e.g.PointNet-Max) produce shape point clouds with point clustering artifacts while sampling-oversensitive discriminators (e.g.PointNet++, DGCNN) fail to guide valid shape generation. We propose the concept of sampling spectrum to depict the different sampling sensitivities of discriminators. We further study how different evaluation metrics weigh the sampling pattern against the geometry and propose several perceptual metrics forming a sampling spectrum of metrics. Guided by the proposed sampling spectrum, we discover a middle-point sampling-aware baseline discriminator, PointNet-Mix, which improves all existing point cloud generators by a large margin on sampling-related metrics. We point out that, though recent research has been focused on the generator design, the main bottleneck of point cloud GAN actually lies in the discriminator design. Our work provides both suggestions and tools for building future discriminators. We will release the code to facilitate future research.

15.Background Modeling via Uncertainty Estimation for Weakly-supervised Action Localization ⬇️

Weakly-supervised temporal action localization aims to detect intervals of action instances with only video-level action labels for training. A crucial challenge is to separate frames of action classes from remaining, denoted as background frames (i.e., frames not belonging to any action class). Previous methods attempt background modeling by either synthesizing pseudo background videos with static frames or introducing an auxiliary class for background. However, they overlook an essential fact that background frames could be dynamic and inconsistent. Accordingly, we cast the problem of identifying background frames as out-of-distribution detection and isolate it from conventional action classification. Beyond our base action localization network, we propose a module to estimate the probability of being background (i.e., uncertainty [20]), which allows us to learn uncertainty given only video-level labels via multiple instance learning. A background entropy loss is further designed to reject background frames by forcing them to have uniform probability distribution for action classes. Extensive experiments verify the effectiveness of our background modeling and show that our method significantly outperforms state-of-the-art methods on the standard benchmarks - THUMOS'14 and ActivityNet (1.2 and 1.3). Our code and the trained model are available at this https URL.

16.Quantum Robust Fitting ⬇️

Many computer vision applications need to recover structure from imperfect measurements of the real world. The task is often solved by robustly fitting a geometric model onto noisy and outlier-contaminated data. However, recent theoretical analyses indicate that many commonly used formulations of robust fitting in computer vision are not amenable to tractable solution and approximation. In this paper, we explore the usage of quantum computers for robust fitting. To do so, we examine and establish the practical usefulness of a robust fitting formulation inspired by Fourier analysis of Boolean functions. We then investigate a quantum algorithm to solve the formulation and analyse the computational speed-up possible over the classical algorithm. Our work thus proposes one of the first quantum treatments of robust fitting for computer vision.

17.Towards Robust Pattern Recognition: A Review ⬇️

The accuracies for many pattern recognition tasks have increased rapidly year by year, achieving or even outperforming human performance. From the perspective of accuracy, pattern recognition seems to be a nearly-solved problem. However, once launched in real applications, the high-accuracy pattern recognition systems may become unstable and unreliable, due to the lack of robustness in open and changing environments. In this paper, we present a comprehensive review of research towards robust pattern recognition from the perspective of breaking three basic and implicit assumptions: closed-world assumption, independent and identically distributed assumption, and clean and big data assumption, which form the foundation of most pattern recognition models. Actually, our brain is robust at learning concepts continually and incrementally, in complex, open and changing environments, with different contexts, modalities and tasks, by showing only a few examples, under weak or noisy supervision. These are the major differences between human intelligence and machine intelligence, which are closely related to the above three assumptions. After witnessing the significant progress in accuracy improvement nowadays, this review paper will enable us to analyze the shortcomings and limitations of current methods and identify future research directions for robust pattern recognition.

18.Multi Layer Neural Networks as Replacement for Pooling Operations ⬇️

Pooling operations are a layer found in almost every modern neural network, which can be calculated at low cost and serves as a linear or nonlinear transfer function for data reduction. Many modern approaches have already dealt with replacing the common maximum value selection and mean value operations by others or even to provide a function that includes different functions which can be selected through changing parameters. Additional neural networks are used to estimate the parameters of these pooling functions. Therefore, these pooling layers need many additional parameters and increase the complexity of the whole model. In this work, we show that already one perceptron can be used very effectively as a pooling operation without increasing the complexity of the model. This kind of pooling allows to integrate multi-layer neural networks directly into a model as a pooling operation by restructuring the data and thus learning complex pooling operations. We compare our approach to tensor convolution with strides as a pooling operation and show that our approach is effective and reduces complexity. The restructuring of the data in combination with multiple perceptrons allows also to use our approach for upscaling, which is used for transposed convolutions in semantic segmentation.

19.The eyes know it: FakeET -- An Eye-tracking Database to Understand Deepfake Perception ⬇️

We present \textbf{FakeET}-- an eye-tracking database to understand human visual perception of \emph{deepfake} videos. Given that the principal purpose of deepfakes is to deceive human observers, FakeET is designed to understand and evaluate the ease with which viewers can detect synthetic video artifacts. FakeET contains viewing patterns compiled from 40 users via the \emph{Tobii} desktop eye-tracker for 811 videos from the \textit{Google Deepfake} dataset, with a minimum of two viewings per video. Additionally, EEG responses acquired via the \emph{Emotiv} sensor are also available. The compiled data confirms (a) distinct eye movement characteristics for \emph{real} vs \emph{fake} videos; (b) utility of the eye-track saliency maps for spatial forgery localization and detection, and (c) Error Related Negativity (ERN) triggers in the EEG responses, and the ability of the \emph{raw} EEG signal to distinguish between \emph{real} and \emph{fake} videos.

20.Does Unsupervised Architecture Representation Learning Help Neural Architecture Search? ⬇️

Existing Neural Architecture Search (NAS) methods either encode neural architectures using discrete encodings that do not scale well, or adopt supervised learning-based methods to jointly learn architecture representations and optimize architecture search on such representations which incurs search bias. Despite the widespread use, architecture representations learned in NAS are still poorly understood. We observe that the structural properties of neural architectures are hard to preserve in the latent space if architecture representation learning and search are coupled, resulting in less effective search performance. In this work, we find empirically that pre-training architecture representations using only neural architectures without their accuracies as labels considerably improve the downstream architecture search efficiency. To explain these observations, we visualize how unsupervised architecture representation learning better encourages neural architectures with similar connections and operators to cluster together. This helps to map neural architectures with similar performance to the same regions in the latent space and makes the transition of architectures in the latent space relatively smooth, which considerably benefits diverse downstream search strategies.

21.Iterate & Cluster: Iterative Semi-Supervised Action Recognition ⬇️

We propose a novel system for active semi-supervised feature-based action recognition. Given time sequences of features tracked during movements our system clusters the sequences into actions. Our system is based on encoder-decoder unsupervised methods shown to perform clustering by self-organization of their latent representation through the auto-regression task. These methods were tested on human action recognition benchmarks and outperformed non-feature based unsupervised methods and achieved comparable accuracy to skeleton-based supervised methods. However, such methods rely on K-Nearest Neighbours (KNN) associating sequences to actions, and general features with no annotated data would correspond to approximate clusters which could be further enhanced. Our system proposes an iterative semi-supervised method to address this challenge and to actively learn the association of clusters and actions. The method utilizes latent space embedding and clustering of the unsupervised encoder-decoder to guide the selection of sequences to be annotated in each iteration. Each iteration, the selection aims to enhance action recognition accuracy while choosing a small number of sequences for annotation. We test the approach on human skeleton-based action recognition benchmarks assuming that only annotations chosen by our method are available and on mouse movements videos recorded in lab experiments. We show that our system can boost recognition performance with only a small percentage of annotations. The system can be used as an interactive annotation tool to guide labeling efforts for 'in the wild' videos of various objects and actions to reach robust recognition.

22.SemifreddoNets: Partially Frozen Neural Networks for Efficient Computer Vision Systems ⬇️

We propose a system comprised of fixed-topology neural networks having partially frozen weights, named SemifreddoNets. SemifreddoNets work as fully-pipelined hardware blocks that are optimized to have an efficient hardware implementation. Those blocks freeze a certain portion of the parameters at every layer and replace the corresponding multipliers with fixed scalers. Fixing the weights reduces the silicon area, logic delay, and memory requirements, leading to significant savings in cost and power consumption. Unlike traditional layer-wise freezing approaches, SemifreddoNets make a profitable trade between the cost and flexibility by having some of the weights configurable at different scales and levels of abstraction in the model. Although fixing the topology and some of the weights somewhat limits the flexibility, we argue that the efficiency benefits of this strategy outweigh the advantages of a fully configurable model for many use cases. Furthermore, our system uses repeatable blocks, therefore it has the flexibility to adjust model complexity without requiring any hardware change. The hardware implementation of SemifreddoNets provides up to an order of magnitude reduction in silicon area and power consumption as compared to their equivalent implementation on a general-purpose accelerator.

23.Rethinking Pre-training and Self-training ⬇️

Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training is commonly used to initialize the backbones of object detection and segmentation models. He et al., however, show a surprising result that ImageNet pre-training has limited impact on COCO object detection. Here we investigate self-training as another method to utilize additional data on the same setup and contrast it against ImageNet pre-training. Our study reveals the generality and flexibility of self-training with three additional insights: 1) stronger data augmentation and more labeled data further diminish the value of pre-training, 2) unlike pre-training, self-training is always helpful when using stronger data augmentation, in both low-data and high-data regimes, and 3) in the case that pre-training is helpful, self-training improves upon pre-training. For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data. Self-training, on the other hand, shows positive improvements from +1.3 to +3.4AP across all dataset sizes. In other words, self-training works well exactly on the same setup that pre-training does not work (using ImageNet to help COCO). On the PASCAL segmentation dataset, which is a much smaller dataset than COCO, though pre-training does help significantly, self-training improves upon the pre-trained model. On COCO object detection, we achieve 54.3AP, an improvement of +1.5AP over the strongest SpineNet model. On PASCAL segmentation, we achieve 90.5 mIOU, an improvement of +1.5% mIOU over the previous state-of-the-art result by DeepLabv3+.

24.Feudal Steering: Hierarchical Learning for Steering Angle Prediction ⬇️

We consider the challenge of automated steering angle prediction for self driving cars using egocentric road images. In this work, we explore the use of feudal networks, used in hierarchical reinforcement learning (HRL), to devise a vehicle agent to predict steering angles from first person, dash-cam images of the Udacity driving dataset. Our method, Feudal Steering, is inspired by recent work in HRL consisting of a manager network and a worker network that operate on different temporal scales and have different goals. The manager works at a temporal scale that is relatively coarse compared to the worker and has a higher level, task-oriented goal space. Using feudal learning to divide the task into manager and worker sub-networks provides more accurate and robust prediction. Temporal abstraction in driving allows more complex primitives than the steering angle at a single time instance. Composite actions comprise a subroutine or skill that can be re-used throughout the driving sequence. The associated subroutine id is the manager network's goal, so that the manager seeks to succeed at the high level task (e.g. a sharp right turn, a slight right turn, moving straight in traffic, or moving straight unencumbered by traffic). The steering angle at a particular time instance is the worker network output which is regulated by the manager's high level task. We demonstrate state-of-the art steering angle prediction results on the Udacity dataset.

25.SegNBDT: Visual Decision Rules for Segmentation ⬇️

The black-box nature of neural networks limits model decision interpretability, in particular for high-dimensional inputs in computer vision and for dense pixel prediction tasks like segmentation. To address this, prior work combines neural networks with decision trees. However, such models (1) perform poorly when compared to state-of-the-art segmentation models or (2) fail to produce decision rules with spatially-grounded semantic meaning. In this work, we build a hybrid neural-network and decision-tree model for segmentation that (1) attains neural network segmentation accuracy and (2) provides semi-automatically constructed visual decision rules such as "Is there a window?". We obtain semantic visual meaning by extending saliency methods to segmentation and attain accuracy by leveraging insights from neural-backed decision trees, a deep learning analog of decision trees for image classification. Our model SegNBDT attains accuracy within ~2-4% of the state-of-the-art HRNetV2 segmentation model while also retaining explainability; we achieve state-of-the-art performance for explainable models on three benchmark datasets -- Pascal-Context (49.12%), Cityscapes (79.01%), and Look Into Person (51.64%). Furthermore, user studies suggest visual decision rules are more interpretable, particularly for incorrect predictions. Code and pretrained models can be found at this https URL.

26.On Improving the Generalization of Face Recognition in the Presence of Occlusions ⬇️

In this paper, we address a key limitation of existing 2D face recognition methods: robustness to occlusions. To accomplish this task, we systematically analyzed the impact of facial attributes on the performance of a state-of-the-art face recognition method and through extensive experimentation, quantitatively analyzed the performance degradation under different types of occlusion. Our proposed Occlusion-aware face REcOgnition (OREO) approach learned discriminative facial templates despite the presence of such occlusions. First, an attention mechanism was proposed that extracted local identity-related region. The local features were then aggregated with the global representations to form a single template. Second, a simple, yet effective, training strategy was introduced to balance the non-occluded and occluded facial images. Extensive experiments demonstrated that OREO improved the generalization ability of face recognition under occlusions by (10.17%) in a single-image-based setting and outperformed the baseline by approximately (2%) in terms of rank-1 accuracy in an image-set-based scenario.

27.On Improving Temporal Consistency for Online Face Liveness Detection ⬇️

In this paper, we focus on improving the online face liveness detection system to enhance the security of the downstream face recognition system. Most of the existing frame-based methods are suffering from the prediction inconsistency across time. To address the issue, a simple yet effective solution based on temporal consistency is proposed. Specifically, in the training stage, to integrate the temporal consistency constraint, a temporal self-supervision loss and a class consistency loss are proposed in addition to the softmax cross-entropy loss. In the deployment stage, a training-free non-parametric uncertainty estimation module is developed to smooth the predictions adaptively. Beyond the common evaluation approach, a video segment-based evaluation is proposed to accommodate more practical scenarios. Extensive experiments demonstrated that our solution is more robust against several presentation attacks in various scenarios, and significantly outperformed the state-of-the-art on multiple public datasets by at least 40% in terms of ACER. Besides, with much less computational complexity (33% fewer FLOPs), it provides great potential for low-latency online applications.

28.PRGFlow: Benchmarking SWAP-Aware Unified Deep Visual Inertial Odometry ⬇️

Odometry on aerial robots has to be of low latency and high robustness whilst also respecting the Size, Weight, Area and Power (SWAP) constraints as demanded by the size of the robot. A combination of visual sensors coupled with Inertial Measurement Units (IMUs) has proven to be the best combination to obtain robust and low latency odometry on resource-constrained aerial robots. Recently, deep learning approaches for Visual Inertial fusion have gained momentum due to their high accuracy and robustness. However, the remarkable advantages of these techniques are their inherent scalability (adaptation to different sized aerial robots) and unification (same method works on different sized aerial robots) by utilizing compression methods and hardware acceleration, which have been lacking from previous approaches.
To this end, we present a deep learning approach for visual translation estimation and loosely fuse it with an Inertial sensor for full 6DoF odometry estimation. We also present a detailed benchmark comparing different architectures, loss functions and compression methods to enable scalability. We evaluate our network on the MSCOCO dataset and evaluate the VI fusion on multiple real-flight trajectories.

29.An Unsupervised Information-Theoretic Perceptual Quality Metric ⬇️

Tractable models of human perception have proved to be challenging to build. Hand-designed models such as MS-SSIM remain popular predictors of human image quality judgements due to their simplicity and speed. Recent modern deep learning approaches can perform better, but they rely on supervised data which can be costly to gather: large sets of class labels such as ImageNet, image quality ratings, or both. We combine recent advances in information-theoretic objective functions with a computational architecture informed by the physiology of the human visual system and unsupervised training on pairs of video frames, yielding our Perceptual Information Metric (PIM). We show that PIM is competitive with supervised metrics on the recent and challenging BAPPS image quality assessment dataset. We also perform qualitative experiments using the ImageNet-C dataset, and establish that our approach is robust with respect to architectural details.

30.Deep Convolutional Likelihood Particle Filter for Visual Tracking ⬇️

We propose a novel particle filter for convolutional-correlation visual trackers. Our method uses correlation response maps to estimate likelihood distributions and employs these likelihoods as proposal densities to sample particles. Likelihood distributions are more reliable than proposal densities based on target transition distributions because correlation response maps provide additional information regarding the target's location. Additionally, our particle filter searches for multiple modes in the likelihood distribution, which improves performance in target occlusion scenarios while decreasing computational costs by more efficiently sampling particles. In other challenging scenarios such as those involving motion blur, where only one mode is present but a larger search area may be necessary, our particle filter allows for the variance of the likelihood distribution to increase. We tested our algorithm on the Visual Tracker Benchmark v1.1 (OTB100) and our experimental results demonstrate that our framework outperforms state-of-the-art methods.

31.Gaze estimation problem tackled through synthetic images ⬇️

In this paper, we evaluate a synthetic framework to be used in the field of gaze estimation employing deep learning techniques. The lack of sufficient annotated data could be overcome by the utilization of a synthetic evaluation framework as far as it resembles the behavior of a real scenario. In this work, we use U2Eyes synthetic environment employing I2Head datataset as real benchmark for comparison based on alternative training and testing strategies. The results obtained show comparable average behavior between both frameworks although significantly more robust and stable performance is retrieved by the synthetic images. Additionally, the potential of synthetically pretrained models in order to be applied in user's specific calibration strategies is shown with outstanding performances.

32.Training Generative Adversarial Networks with Limited Data ⬇️

Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch and when fine-tuning an existing GAN on another dataset. We demonstrate, on several datasets, that good results are now possible using only a few thousand training images, often matching StyleGAN2 results with an order of magnitude fewer images. We expect this to open up new application domains for GANs. We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5.59 to 2.67.

33.Residual Force Control for Agile Human Behavior Imitation and Extended Motion Synthesis ⬇️

Reinforcement learning has shown great promise for synthesizing realistic human behaviors by learning humanoid control policies from motion capture data. However, it is still very challenging to reproduce sophisticated human skills like ballet dance, or to stably imitate long-term human behaviors with complex transitions. The main difficulty lies in the dynamics mismatch between the humanoid model and real humans. That is, motions of real humans may not be physically possible for the humanoid model. To overcome the dynamics mismatch, we propose a novel approach, residual force control (RFC), that augments a humanoid control policy by adding external residual forces into the action space. During training, the RFC-based policy learns to apply residual forces to the humanoid to compensate for the dynamics mismatch and better imitate the reference motion. Experiments on a wide range of dynamic motions demonstrate that our approach outperforms state-of-the-art methods in terms of convergence speed and the quality of learned motions. For the first time, we show a physics-based virtual character performing highly agile ballet dance moves such as pirouette, arabesque and jeté. Furthermore, we propose a dual-policy control framework, where a kinematic policy and an RFC-based policy work in tandem to synthesize multi-modal infinite-horizon human motions without any task guidance or user input. Our approach is the first humanoid control method that successfully learns from a large-scale human motion dataset (Human3.6M) and generates diverse long-term motions.

34.CPR: Classifier-Projection Regularization for Continual Learning ⬇️

We propose a general, yet simple patch that can be applied to existing regularization-based continual learning methods called classifier-projection regularization (CPR). Inspired by both recent results on neural networks with wide local minima and information theory, CPR adds an additional regularization term that maximizes the entropy of a classifier's output probability. We demonstrate that this additional term can be interpreted as a projection of the conditional probability given by a classifier's output to the uniform distribution. By applying the Pythagorean theorem for KL divergence, we then prove that this projection may (in theory) improve the performance of continual learning methods. In our extensive experimental results, we apply CPR to several state-of-the-art regularization-based continual learning methods and benchmark performance on popular image recognition datasets. Our results demonstrate that CPR indeed promotes a wide local minima and significantly improves both accuracy and plasticity while simultaneously mitigating the catastrophic forgetting of baseline continual learning methods.

35.FedGAN: Federated Generative Adversarial Networks for Distributed Data ⬇️

We propose Federated Generative Adversarial Network (FedGAN) for training a GAN across distributed sources of non-independent-and-identically-distributed data sources subject to communication and privacy constraints. Our algorithm uses local generators and discriminators which are periodically synced via an intermediary that averages and broadcasts the generator and discriminator parameters. We theoretically prove the convergence of FedGAN with both equal and two time-scale updates of generator and discriminator, under standard assumptions, using stochastic approximations and communication efficient stochastic gradient descents. We experiment FedGAN on toy examples (2D system, mixed Gaussian, and Swiss role), image datasets (MNIST, CIFAR-10, and CelebA), and time series datasets (household electricity consumption and electric vehicle charging sessions). We show FedGAN converges and has similar performance to general distributed GAN, while reduces communication complexity. We also show its robustness to reduced communications.

36.Sparse and Continuous Attention Mechanisms ⬇️

Exponential families are widely used in machine learning; they include many distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation). Distributions in each of these families have fixed support. In contrast, for finite domains, there has been recent work on sparse alternatives to softmax (e.g. sparsemax and alpha-entmax), which have varying support, being able to assign zero probability to irrelevant categories. This paper expands that work in two directions: first, we extend alpha-entmax to continuous domains, revealing a link with Tsallis statistics and deformed exponential families. Second, we introduce continuous-domain attention mechanisms, deriving efficient gradient backpropagation algorithms for alpha in {1,2}. Experiments on attention-based text classification, machine translation, and visual question answering illustrate the use of continuous attention in 1D and 2D, showing that it allows attending to time intervals and compact regions.

37.HMIC: Hierarchical Medical Image Classification, A Deep Learning Approach ⬇️

Image classification is central to the big data revolution in medicine. Improved information processing methods for diagnosis and classification of digital medical images have shown to be successful via deep learning approaches. As this field is explored, there are limitations to the performance of traditional supervised classifiers. This paper outlines an approach that is different from the current medical image classification tasks that view the issue as multi-class classification. We performed a hierarchical classification using our Hierarchical Medical Image classification (HMIC) approach. HMIC uses stacks of deep learning models to give particular comprehension at each level of the clinical picture hierarchy. For testing our performance, we use biopsy of the small bowel images that contain three categories in the parent level (Celiac Disease, Environmental Enteropathy, and histologically normal controls). For the child level, Celiac Disease Severity is classified into 4 classes (I, IIIa, IIIb, and IIIC).

38.Move-to-Data: A new Continual Learning approach with Deep CNNs, Application for image-class recognition ⬇️

In many real-life tasks of application of supervised learning approaches, all the training data are not available at the same time. The examples are lifelong image classification or recognition of environmental objects during interaction of instrumented persons with their environment, enrichment of an online-database with more images. It is necessary to pre-train the model at a "training recording phase" and then adjust it to the new coming data. This is the task of incremental/continual learning approaches. Amongst different problems to be solved by these approaches such as introduction of new categories in the model, refining existing categories to sub-categories and extending trained classifiers over them, ... we focus on the problem of adjusting pre-trained model with new additional training data for existing categories. We propose a fast continual learning layer at the end of the neuronal network. Obtained results are illustrated on the opensource CIFAR benchmark dataset. The proposed scheme yields similar performances as retraining but with drastically lower computational cost.

39.Non-Negative Bregman Divergence Minimization for Deep Direct Density Ratio Estimation ⬇️

The estimation of the ratio of two probability densities has garnered attention as the density ratio is useful in various machine learning tasks, such as anomaly detection and domain adaptation. To estimate the density ratio, methods collectively known as direct density ratio estimation (DRE) have been explored. These methods are based on the minimization of the Bregman (BR) divergence between a density ratio model and the true density ratio. However, existing direct DRE suffers from serious overfitting when using flexible models such as neural networks. In this paper, we introduce a non-negative correction for empirical risk using only the prior knowledge of the upper bound of the density ratio. This correction makes a DRE method more robust against overfitting and enables the use of flexible models. In the theoretical analysis, we discuss the consistency of the empirical risk. In our experiments, the proposed estimators show favorable performance in inlier-based outlier detection and covariate shift adaptation.

40.Early Detection of Retinopathy of Prematurity (ROP) in Retinal Fundus Images Via Convolutional Neural Networks ⬇️

Retinopathy of prematurity (ROP) is an abnormal blood vessel development in the retina of a prematurely-born infant or an infant with low birth weight. ROP is one of the leading causes for infant blindness globally. Early detection of ROP is critical to slow down and avert the progression to vision impairment caused by ROP. Yet there is limited awareness of ROP even among medical professionals. Consequently, dataset for ROP is limited if ever available, and is in general extremely imbalanced in terms of the ratio between negative images and positive ones. In this study, we formulate the problem of detecting ROP in retinal fundus images in an optimization framework, and apply state-of-art convolutional neural network techniques to solve this problem. Experimental results based on our models achieve 100 percent sensitivity, 96 percent specificity, 98 percent accuracy, and 96 percent precision. In addition, our study shows that as the network gets deeper, more significant features can be extracted for better understanding of ROP.

41.LSSL: Longitudinal Self-Supervised Learning ⬇️

Longitudinal neuroimaging or biomedical studies often acquire multiple observations from each individual over time, which entails repeated measures with highly interdependent variables. In this paper, we discuss the implication of repeated measures design on unsupervised learning by showing its tight conceptual connection to self-supervised learning and factor disentanglement. Leveraging the ability for self-comparison' through repeated measures, we explicitly separate the definition of the factor space and the representation space enabling an exact disentanglement of time-related factors from the representations of the images. By formulating deterministic multivariate mapping functions between the two spaces, our model, named Longitudinal Self-Supervised Learning (LSSL), uses a standard autoencoding structure with a cosine loss to estimate the direction linked to the disentangled factor. We apply LSSL to two longitudinal neuroimaging studies to show its unique advantage in extracting the brain-age' information from the data and in revealing informative characteristics associated with neurodegenerative and neuropsychological disorders. For a downstream task of supervised diagnosis classification, the representations learned by LSSL permit faster convergence and higher (or similar) prediction accuracy compared to several other representation learning techniques.

42.Potential Field Guided Actor-Critic Reinforcement Learning ⬇️

In this paper, we consider the problem of actor-critic reinforcement learning. Firstly, we extend the actor-critic architecture to actor-critic-N architecture by introducing more critics beyond rewards. Secondly, we combine the reward-based critic with a potential-field-based critic to formulate the proposed potential field guided actor-critic reinforcement learning approach (actor-critic-2). This can be seen as a combination of the model-based gradients and the model-free gradients in policy improvement. State with large potential field often contains a strong prior information, such as pointing to the target at a long distance or avoiding collision by the side of an obstacle. In this situation, we should trust potential-field-based critic more as policy evaluation to accelerate policy improvement, where action policy tends to be guided. For example, in practical application, learning to avoid obstacles should be guided rather than learned by trial and error. State with small potential filed is often lack of information, for example, at the local minimum point or around the moving target. At this time, we should trust reward-based critic as policy evaluation more to evaluate the long-term return. In this case, action policy tends to explore. In addition, potential field evaluation can be combined with planning to estimate a better state value function. In this way, reward design can focus more on the final stage of reward, rather than reward shaping or phased reward. Furthermore, potential field evaluation can make up for the lack of communication in multi-agent cooperation problem, i.e., multi-agent each has a reward-based critic and a relative unified potential-field-based critic with prior information. Thirdly, simplified experiments on predator-prey game demonstrate the effectiveness of the proposed approach.

43.Online Sequential Extreme Learning Machines: Features Combined From Hundreds of Midlayers ⬇️

In this paper, we develop an algorithm called hierarchal online sequential learning algorithm (H-OS-ELM) for single feed feedforward network with features combined from hundreds of midlayers, the algorithm can learn chunk by chunk with fixed or varying block size, we believe that the diverse selectivity of neurons in top layers which consists of encoded distributed information produced by the other neurons offers better computational advantage over inference accuracy. Thus this paper proposes a Hierarchical model framework combined with Online-Sequential learning algorithm, Firstly the model consists of subspace feature extractor which consists of subnetwork neuron, using the sub-features which is result of the feature extractor in first layer of the hierarchy we get rid of irrelevant factors which are of no use for the learning and iterate this process so that to recast the the subfeatures into the hierarchical model to be processed into more acceptable cognition. Secondly by using OS-Elm we are using non-iterative style for learning we are implementing a network which is wider and shallow which plays a important role in generalizing the overall performance which in turn boosts up the learning speed

44.Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks ⬇️

Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights. Many successful experimental results have been recently achieved using the empirical straight-through estimation approach. This approach has generated a variety of ad-hoc rules for propagating gradients through non-differentiable activations and updating discrete weights. We put such methods on a solid basis by obtaining them as viable approximations in the stochastic binary network (SBN) model with Bernoulli weights. In this model gradients are well-defined and the weight probabilities can be optimized by continuous techniques. By choosing the activation noises in SBN appropriately and choosing mirror descent (MD) for optimization, we obtain methods that closely resemble several existing straight-through variants, but unlike them, all work reliably and produce equally good results. We further show that variational inference for Bayesian learning of Binary weights can be implemented using MD updates with the same simplicity.

45.Combining the band-limited parameterization and Semi-Lagrangian Runge--Kutta integration for efficient PDE-constrained LDDMM ⬇️

The family of PDE-constrained LDDMM methods is emerging as a particularly interesting approach for physically meaningful diffeomorphic transformations. The original combination of Gauss--Newton--Krylov optimization and Runge--Kutta integration, shows excellent numerical accuracy and fast convergence rate. However, its most significant limitation is the huge computational complexity, hindering its extensive use in Computational Anatomy applied studies. This limitation has been treated independently by the problem formulation in the space of band-limited vector fields and Semi-Lagrangian integration. The purpose of this work is to combine both in three variants of band-limited PDE-constrained LDDMM for further increasing their computational efficiency. The accuracy of the resulting methods is evaluated extensively. For all the variants, the proposed combined approach shows a significant increment of the computational efficiency. In addition, the variant based on the deformation state equation is positioned consistently as the best performing method across all the evaluation frameworks in terms of accuracy and efficiency.

46.Automated Identification of Thoracic Pathology from Chest Radiographs with Enhanced Training Pipeline ⬇️

Chest x-rays are the most common radiology studies for diagnosing lung and heart disease. Hence, a system for automated pre-reporting of pathologic findings on chest x-rays would greatly enhance radiologists' productivity. To this end, we investigate a deep-learning framework with novel training schemes for classification of different thoracic pathology labels from chest x-rays. We use the currently largest publicly available annotated dataset ChestX-ray14 of 112,120 chest radiographs of 30,805 patients. Each image was annotated with either a 'NoFinding' class, or one or more of 14 thoracic pathology labels. Subjects can have multiple pathologies, resulting in a multi-class, multi-label problem. We encoded labels as binary vectors using k-hot encoding. We study the ResNet34 architecture, pre-trained on ImageNet, where two key modifications were incorporated into the training framework: (1) Stochastic gradient descent with momentum and with restarts using cosine annealing, (2) Variable image sizes for fine-tuning to prevent overfitting. Additionally, we use a heuristic algorithm to select a good learning rate. Learning with restarts was used to avoid local minima. Area Under receiver operating characteristics Curve (AUC) was used to quantitatively evaluate diagnostic quality. Our results are comparable to, or outperform the best results of current state-of-the-art methods with AUCs as follows: Atelectasis:0.81, Cardiomegaly:0.91, Consolidation:0.81, Edema:0.92, Effusion:0.89, Emphysema: 0.92, Fibrosis:0.81, Hernia:0.84, Infiltration:0.73, Mass:0.85, Nodule:0.76, Pleural Thickening:0.81, Pneumonia:0.77, Pneumothorax:0.89 and NoFinding:0.79. Our results suggest that, in addition to using sophisticated network architectures, a good learning rate, scheduler and a robust optimizer can boost performance.

47.Multigrid-in-Channels Architectures for Wide Convolutional Neural Networks ⬇️

We present a multigrid approach that combats the quadratic growth of the number of parameters with respect to the number of channels in standard convolutional neural networks (CNNs). It has been shown that there is a redundancy in standard CNNs, as networks with much sparser convolution operators can yield similar performance to full networks. The sparsity patterns that lead to such behavior, however, are typically random, hampering hardware efficiency. In this work, we present a multigrid-in-channels approach for building CNN architectures that achieves full coupling of the channels, and whose number of parameters is linearly proportional to the width of the network. To this end, we replace each convolution layer in a generic CNN with a multilevel layer consisting of structured (i.e., grouped) convolutions. Our examples from supervised image classification show that applying this strategy to residual networks and MobileNetV2 considerably reduces the number of parameters without negatively affecting accuracy. Therefore, we can widen networks without dramatically increasing the number of parameters or operations.

48.One Ring to Rule Them All: Certifiably Robust Geometric Perception with Outliers ⬇️

We propose a general and practical framework to design certifiable algorithms for robust geometric perception in the presence of a large amount of outliers. We investigate the use of a truncated least squares (TLS) cost function, which is known to be robust to outliers, but leads to hard, nonconvex, and nonsmooth optimization problems. Our first contribution is to show that -for a broad class of geometric perception problems- TLS estimation can be reformulated as an optimization over the ring of polynomials and Lasserre's hierarchy of convex moment relaxations is empirically tight at the minimum relaxation order (i.e., certifiably obtains the global minimum of the nonconvex TLS problem). Our second contribution is to exploit the structural sparsity of the objective and constraint polynomials and leverage basis reduction to significantly reduce the size of the semidefinite program (SDP) resulting from the moment relaxation, without compromising its tightness. Our third contribution is to develop scalable dual optimality certifiers from the lens of sums-of-squares (SOS) relaxation, that can compute the suboptimality gap and possibly certify global optimality of any candidate solution (e.g., returned by fast heuristics such as RANSAC or graduated non-convexity). Our dual certifiers leverage Douglas-Rachford Splitting to solve a convex feasibility SDP. Numerical experiments across different perception problems, including high-integrity satellite pose estimation, demonstrate the tightness of our relaxations, the correctness of the certification, and the scalability of the proposed dual certifiers to large problems, beyond the reach of current SDP solvers.

49.Data Driven Prediction Architecture for Autonomous Driving and its Application on Apollo Platform ⬇️

Autonomous Driving vehicles (ADV) are on road with large scales. For safe and efficient operations, ADVs must be able to predict the future states and iterative with road entities in complex, real-world driving scenarios. How to migrate a well-trained prediction model from one geo-fenced area to another is essential in scaling the ADV operation and is difficult most of the time since the terrains, traffic rules, entities distributions, driving/walking patterns would be largely different in different geo-fenced operation areas. In this paper, we introduce a highly automated learning-based prediction model pipeline, which has been deployed on Baidu Apollo self-driving platform, to support different prediction learning sub-modules' data annotation, feature extraction, model training/tuning and deployment. This pipeline is completely automatic without any human intervention and shows an up to 400% efficiency increase in parameter tuning, when deployed at scale in different scenarios across nations.