Skip to content

Latest commit

 

History

History
219 lines (219 loc) · 148 KB

20201020.md

File metadata and controls

219 lines (219 loc) · 148 KB

ArXiv cs.CV --Tue, 20 Oct 2020

1.PseudoSeg: Designing Pseudo Labels for Semantic Segmentation ⬇️

Recent advances in semi-supervised learning (SSL) demonstrate that a combination of consistency regularization and pseudo-labeling can effectively improve image classification accuracy in the low-data regime. Compared to classification, semantic segmentation tasks require much more intensive labeling costs. Thus, these tasks greatly benefit from data-efficient training methods. However, structured outputs in segmentation render particular difficulties (e.g., designing pseudo-labeling and augmentation) to apply existing SSL strategies. To address this problem, we present a simple and novel re-design of pseudo-labeling to generate well-calibrated structured pseudo labels for training with unlabeled or weakly-labeled data. Our proposed pseudo-labeling strategy is network structure agnostic to apply in a one-stage consistency training framework. We demonstrate the effectiveness of the proposed pseudo-labeling strategy in both low-data and high-data regimes. Extensive experiments have validated that pseudo labels generated from wisely fusing diverse sources and strong data augmentation are crucial to consistency training for segmentation. The source code is available at this https URL.

2.Self-supervised Co-training for Video Representation Learning ⬇️

The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, showing that this form of supervised contrastive learning leads to a clear improvement in performance; (ii) we propose a novel self-supervised co-training scheme to improve the popular infoNCE loss, exploiting the complementary information from different views, RGB streams and optical flow, of the same data source by using one view to obtain positive class samples for the other; (iii) we thoroughly evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval. In both cases, the proposed approach demonstrates state-of-the-art or comparable performance with other self-supervised approaches, whilst being significantly more efficient to train, i.e. requiring far less training data to achieve similar performance.

3.Multiple Pedestrians and Vehicles Tracking in Aerial Imagery: A Comprehensive Study ⬇️

In this paper, we address various challenges in multi-pedestrian and vehicle tracking in high-resolution aerial imagery by intensive evaluation of a number of traditional and Deep Learning based Single- and Multi-Object Tracking methods. We also describe our proposed Deep Learning based Multi-Object Tracking method AerialMPTNet that fuses appearance, temporal, and graphical information using a Siamese Neural Network, a Long Short-Term Memory, and a Graph Convolutional Neural Network module for a more accurate and stable tracking. Moreover, we investigate the influence of the Squeeze-and-Excitation layers and Online Hard Example Mining on the performance of AerialMPTNet. To the best of our knowledge, we are the first in using these two for a regression-based Multi-Object Tracking. Additionally, we studied and compared the L1 and Huber loss functions. In our experiments, we extensively evaluate AerialMPTNet on three aerial Multi-Object Tracking datasets, namely AerialMPT and KIT AIS pedestrian and vehicle datasets. Qualitative and quantitative results show that AerialMPTNet outperforms all previous methods for the pedestrian datasets and achieves competitive results for the vehicle dataset. In addition, Long Short-Term Memory and Graph Convolutional Neural Network modules enhance the tracking performance. Moreover, using Squeeze-and-Excitation and Online Hard Example Mining significantly helps for some cases while degrades the results for other cases. In addition, according to the results, L1 yields better results with respect to Huber loss for most of the scenarios. The presented results provide a deep insight into challenges and opportunities of the aerial Multi-Object Tracking domain, paving the way for future research.

4.Detecting Hands and Recognizing Physical Contact in the Wild ⬇️

We investigate a new problem of detecting hands and recognizing their physical contact state in unconstrained conditions. This is a challenging inference task given the need to reason beyond the local appearance of hands. The lack of training annotations indicating which object or parts of an object the hand is in contact with further complicates the task. We propose a novel convolutional network based on Mask-RCNN that can jointly learn to localize hands and predict their physical contact to address this problem. The network uses outputs from another object detector to obtain locations of objects present in the scene. It uses these outputs and hand locations to recognize the hand's contact state using two attention mechanisms. The first attention mechanism is based on the hand and a region's affinity, enclosing the hand and the object, and densely pools features from this region to the hand region. The second attention module adaptively selects salient features from this plausible region of contact. To develop and evaluate our method's performance, we introduce a large-scale dataset called ContactHands, containing unconstrained images annotated with hand locations and contact states. The proposed network, including the parameters of attention modules, is end-to-end trainable. This network achieves approximately 7% relative improvement over a baseline network that was built on the vanilla Mask-RCNN architecture and trained for recognizing hand contact states.

5.Multi-Stage Fusion for One-Click Segmentation ⬇️

Segmenting objects of interest in an image is an essential building block of applications such as photo-editing and image analysis. Under interactive settings, one should achieve good segmentations while minimizing user input. Current deep learning-based interactive segmentation approaches use early fusion and incorporate user cues at the image input layer. Since segmentation CNNs have many layers, early fusion may weaken the influence of user interactions on the final prediction results. As such, we propose a new multi-stage guidance framework for interactive segmentation. By incorporating user cues at different stages of the network, we allow user interactions to impact the final segmentation output in a more direct way. Our proposed framework has a negligible increase in parameter count compared to early-fusion frameworks. We perform extensive experimentation on the standard interactive instance segmentation and one-click segmentation benchmarks and report state-of-the-art performance.

6.Attention Augmented ConvLSTM forEnvironment Prediction ⬇️

Safe and proactive planning in robotic systems generally requires accurate predictions of the environment. Prior work on environment prediction applied video frame prediction techniques to bird's-eye view environment representations, such as occupancy grids. ConvLSTM-based frameworks used previously often result in significant blurring and vanishing of moving objects, thus hindering their applicability for use in safety-critical applications. In this work, we propose two extensions to the ConvLSTM to address these issues. We present the Temporal Attention Augmented ConvLSTM (TAAConvLSTM) and Self-Attention Augmented ConvLSTM (SAAConvLSTM) frameworks for spatiotemporal occupancy prediction, and demonstrate improved performance over baseline architectures on the real-world KITTI and Waymo datasets.

7.Multi-Modal Super Resolution for Dense Microscopic Particle Size Estimation ⬇️

Particle Size Analysis (PSA) is an important process carried out in a number of industries, which can significantly influence the properties of the final product. A ubiquitous instrument for this purpose is the Optical Microscope (OM). However, OMs are often prone to drawbacks like low resolution, small focal depth, and edge features being masked due to diffraction. We propose a powerful application of a combination of two Conditional Generative Adversarial Networks (cGANs) that Super Resolve OM images to look like Scanning Electron Microscope (SEM) images. We further demonstrate the use of a custom object detection module that can perform efficient PSA of the super-resolved particles on both, densely and sparsely packed images. The PSA results obtained from the super-resolved images have been benchmarked against human annotators, and results obtained from the corresponding SEM images. The proposed models show a generalizable way of multi-modal image translation and super-resolution for accurate particle size estimation.

8.Multiclass Wound Image Classification using an Ensemble Deep CNN-based Classifier ⬇️

Acute and chronic wounds are a challenge to healthcare systems around the world and affect many people's lives annually. Wound classification is a key step in wound diagnosis that would help clinicians to identify an optimal treatment procedure. Hence, having a high-performance classifier assists the specialists in the field to classify the wounds with less financial and time costs. Different machine learning and deep learning-based wound classification methods have been proposed in the literature. In this study, we have developed an ensemble Deep Convolutional Neural Network-based classifier to classify wound images including surgical, diabetic, and venous ulcers, into multi-classes. The output classification scores of two classifiers (patch-wise and image-wise) are fed into a Multi-Layer Perceptron to provide a superior classification performance. A 5-fold cross-validation approach is used to evaluate the proposed method. We obtained maximum and average classification accuracy values of 96.4% and 94.28% for binary and 91.9% and 87.7% for 3-class classification problems. The results show that our proposed method can be used effectively as a decision support system in classification of wound images or other related clinical applications.

9.Brain Atlas Guided Attention U-Net for White Matter Hyperintensity Segmentation ⬇️

White Matter Hyperintensities (WMH) are the most common manifestation of cerebral small vessel disease (cSVD) on the brain MRI. Accurate WMH segmentation algorithms are important to determine cSVD burden and its clinical consequences. Most of existing WMH segmentation algorithms require both fluid attenuated inversion recovery (FLAIR) images and T1-weighted images as inputs. However, T1-weighted images are typically not part of standard clinicalscans which are acquired for patients with acute stroke. In this paper, we propose a novel brain atlas guided attention U-Net (BAGAU-Net) that leverages only FLAIR images with a spatially-registered white matter (WM) brain atlas to yield competitive WMH segmentation performance. Specifically, we designed a dual-path segmentation model with two novel connecting mechanisms, namely multi-input attention module (MAM) and attention fusion module (AFM) to fuse the information from two paths for accurate results. Experiments on two publicly available datasets show the effectiveness of the proposed BAGAU-Net. With only FLAIR images and WM brain atlas, BAGAU-Net outperforms the state-of-the-art method with T1-weighted images, paving the way for effective development of WMH segmentation. Availability:this https URL

10.Learning to Reconstruct and Segment 3D Objects ⬇️

To endow machines with the ability to perceive the real-world in a three dimensional representation as we do as humans is a fundamental and long-standing topic in Artificial Intelligence. Given different types of visual inputs such as images or point clouds acquired by 2D/3D sensors, one important goal is to understand the geometric structure and semantics of the 3D environment. Traditional approaches usually leverage hand-crafted features to estimate the shape and semantics of objects or scenes. However, they are difficult to generalize to novel objects and scenarios, and struggle to overcome critical issues caused by visual occlusions. By contrast, we aim to understand scenes and the objects within them by learning general and robust representations using deep neural networks, trained on large-scale real-world 3D data. To achieve these aims, this thesis makes three core contributions from object-level 3D shape estimation from single or multiple views to scene-level semantic understanding.

11.Teacher-Student Competition for Unsupervised Domain Adaption ⬇️

With the supervision from source domain only in class-level, existing unsupervised domain adaption (UDA) methods mainly learn the domain-invariant representations from a shared feature extractor, which causes the source-bias problem. This paper proposes an unsupervised domain adaption approach with Teacher-Student Competition (TSC). In particular, a student network is introduced to learn the target-specific feature space, and we design a novel competition mechanism to select more credible pseudo-labels for the training of student network. We introduce a teacher network with the structure of existing conventional UDA method, and both teacher and student networks compete to provide target pseudo-labels to constrain every target sample's training in student network. Extensive experiments demonstrate that our proposed TSC framework significantly outperforms the state-of-the-art domain adaption methods on Office-31 and ImageCLEF-DA benchmarks.

12.On the Generalisation Capabilities of Fingerprint Presentation Attack Detection Methods in the Short Wave Infrared Domain ⬇️

Nowadays, fingerprint-based biometric recognition systems are becoming increasingly popular. However, in spite of their numerous advantages, biometric capture devices are usually exposed to the public and thus vulnerable to presentation attacks (PAs). Therefore, presentation attack detection (PAD) methods are of utmost importance in order to distinguish between bona fide and attack presentations. Due to the nearly unlimited possibilities to create new presentation attack instruments (PAIs), unknown attacks are a threat to existing PAD algorithms. This fact motivates research on generalisation capabilities in order to find PAD methods that are resilient to new attacks. In this context, we evaluate the generalisability of multiple PAD algorithms on a dataset of 19,711 bona fide and 4,339 PA samples, including 45 different PAI species. The PAD data is captured in the short wave infrared domain and the results discuss the advantages and drawbacks of this PAD technique regarding unknown attacks.

13.Domain Generalized Person Re-Identification via Cross-Domain Episodic Learning ⬇️

Aiming at recognizing images of the same person across distinct camera views, person re-identification (re-ID) has been among active research topics in computer vision. Most existing re-ID works require collection of a large amount of labeled image data from the scenes of interest. When the data to be recognized are different from the source-domain training ones, a number of domain adaptation approaches have been proposed. Nevertheless, one still needs to collect labeled or unlabelled target-domain data during training. In this paper, we tackle an even more challenging and practical setting, domain generalized (DG) person re-ID. That is, while a number of labeled source-domain datasets are available, we do not have access to any target-domain training data. In order to learn domain-invariant features without knowing the target domain of interest, we present an episodic learning scheme which advances meta learning strategies to exploit the observed source-domain labeled data. The learned features would exhibit sufficient domain-invariant properties while not overfitting the source-domain data or ID labels. Our experiments on four benchmark datasets confirm the superiority of our method over the state-of-the-arts.

14.A Versatile Crack Inspection Portable System based on Classifier Ensemble and Controlled Illumination ⬇️

This paper presents a novel setup for automatic visual inspection of cracks in ceramic tile as well as studies the effect of various classifiers and height-varying illumination conditions for this task. The intuition behind this setup is that cracks can be better visualized under specific lighting conditions than others. Our setup, which is designed for field work with constraints in its maximum dimensions, can acquire images for crack detection with multiple lighting conditions using the illumination sources placed at multiple heights. Crack detection is then performed by classifying patches extracted from the acquired images in a sliding window fashion. We study the effect of lights placed at various heights by training classifiers both on customized as well as state-of-the-art architectures and evaluate their performance both at patch-level and image-level, demonstrating the effectiveness of our setup. More importantly, ours is the first study that demonstrates how height-varying illumination conditions can affect crack detection with the use of existing state-of-the-art classifiers. We provide an insight about the illumination conditions that can help in improving crack detection in a challenging real-world industrial environment.

15.RONELD: Robust Neural Network Output Enhancement for Active Lane Detection ⬇️

Accurate lane detection is critical for navigation in autonomous vehicles, particularly the active lane which demarcates the single road space that the vehicle is currently traveling on. Recent state-of-the-art lane detection algorithms utilize convolutional neural networks (CNNs) to train deep learning models on popular benchmarks such as TuSimple and CULane. While each of these models works particularly well on train and test inputs obtained from the same dataset, the performance drops significantly on unseen datasets of different environments. In this paper, we present a real-time robust neural network output enhancement for active lane detection (RONELD) method to identify, track, and optimize active lanes from deep learning probability map outputs. We first adaptively extract lane points from the probability map outputs, followed by detecting curved and straight lanes before using weighted least squares linear regression on straight lanes to fix broken lane edges resulting from fragmentation of edge maps in real images. Lastly, we hypothesize true active lanes through tracking preceding frames. Experimental results demonstrate an up to two-fold increase in accuracy using RONELD on cross-dataset validation tests.

16.Weakly-supervised Learning For Catheter Segmentation in 3D Frustum Ultrasound ⬇️

Accurate and efficient catheter segmentation in 3D ultrasound (US) is essential for cardiac intervention. Currently, the state-of-the-art segmentation algorithms are based on convolutional neural networks (CNNs), which achieved remarkable performances in a standard Cartesian volumetric data. Nevertheless, these approaches suffer the challenges of low efficiency and GPU unfriendly image size. Therefore, such difficulties and expensive hardware requirements become a bottleneck to build accurate and efficient segmentation models for real clinical application. In this paper, we propose a novel Frustum ultrasound based catheter segmentation method. Specifically, Frustum ultrasound is a polar coordinate based image, which includes same information of standard Cartesian image but has much smaller size, which overcomes the bottleneck of efficiency than conventional Cartesian images. Nevertheless, the irregular and deformed Frustum images lead to more efforts for accurate voxel-level annotation. To address this limitation, a weakly supervised learning framework is proposed, which only needs 3D bounding box annotations overlaying the region-of-interest to training the CNNs. Although the bounding box annotation includes noise and inaccurate annotation to mislead to model, it is addressed by the proposed pseudo label generated scheme. The labels of training voxels are generated by incorporating class activation maps with line filtering, which is iteratively updated during the training. Our experimental results show the proposed method achieved the state-of-the-art performance with an efficiency of 0.25 second per volume. More crucially, the Frustum image segmentation provides a much faster and cheaper solution for segmentation in 3D US image, which meet the demands of clinical applications.

17.Deep Multi-path Network Integrating Incomplete Biomarker and Chest CT Data for Evaluating Lung Cancer Risk ⬇️

Clinical data elements (CDEs) (e.g., age, smoking history), blood markers and chest computed tomography (CT) structural features have been regarded as effective means for assessing lung cancer risk. These independent variables can provide complementary information and we hypothesize that combining them will improve the prediction accuracy. In practice, not all patients have all these variables available. In this paper, we propose a new network design, termed as multi-path multi-modal missing network (M3Net), to integrate the multi-modal data (i.e., CDEs, biomarker and CT image) considering missing modality with multiple paths neural network. Each path learns discriminative features of one modality, and different modalities are fused in a second stage for an integrated prediction. The network can be trained end-to-end with both medical image features and CDEs/biomarkers, or make a prediction with single modality. We evaluate M3Net with datasets including three sites from the Consortium for Molecular and Cellular Characterization of Screen-Detected Lesions (MCL) project. Our method is cross validated within a cohort of 1291 subjects (383 subjects with complete CDEs/biomarkers and CT images), and externally validated with a cohort of 99 subjects (99 with complete CDEs/biomarkers and CT images). Both cross-validation and external-validation results show that combining multiple modality significantly improves the predicting performance of single modality. The results suggest that integrating subjects with missing either CDEs/biomarker or CT imaging features can contribute to the discriminatory power of our model (p < 0.05, bootstrap two-tailed test). In summary, the proposed M3Net framework provides an effective way to integrate image and non-image data in the context of missing information.

18.Emerging Trends of Multimodal Research in Vision and Language ⬇️

Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data. More recently, this has enhanced research interests in the intersection of the Vision and Language arena with its numerous applications and fast-paced growth. In this paper, we present a detailed overview of the latest trends in research pertaining to visual and language modalities. We look at its applications in their task formulations and how to solve various problems related to semantic perception and content generation. We also address task-specific trends, along with their evaluation strategies and upcoming challenges. Moreover, we shed some light on multi-disciplinary patterns and insights that have emerged in the recent past, directing this field towards more modular and transparent intelligent systems. This survey identifies key trends gravitating recent literature in VisLang research and attempts to unearth directions that the field is heading towards.

19.A Backbone Replaceable Fine-tuning Network for Stable Face Alignment ⬇️

Heatmap regression based face alignment algorithms have achieved prominent performance on static images. However, when applying these methods on videos or sequential images, the stability and accuracy are remarkably discounted. The reason lies in temporal informations are not considered, which is mainly reflected in network structure and loss function. This paper presents a novel backbone replaceable fine-tuning framework, which can swiftly convert facial landmark detector designed for single image level into a better performing one that suitable for videos. On this basis, we proposed the Jitter loss, an innovative temporal information based loss function devised to impose strong penalties on prediction landmarks that jitter around the ground truth. Our framework provides capabilities to achieve at least 40% performance improvement on stability evaluation metrices while enhancing accuracy without re-training the entire model versus state-of-the-art methods.

20.Softer Pruning, Incremental Regularization ⬇️

Network pruning is widely used to compress Deep Neural Networks (DNNs). The Soft Filter Pruning (SFP) method zeroizes the pruned filters during training while updating them in the next training epoch. Thus the trained information of the pruned filters is completely dropped. To utilize the trained pruned filters, we proposed a SofteR Filter Pruning (SRFP) method and its variant, Asymptotic SofteR Filter Pruning (ASRFP), simply decaying the pruned weights with a monotonic decreasing parameter. Our methods perform well across various networks, datasets and pruning rates, also transferable to weight pruning. On ILSVRC-2012, ASRFP prunes 40% of the parameters on ResNet-34 with 1.63% top-1 and 0.68% top-5 accuracy improvement. In theory, SRFP and ASRFP are an incremental regularization of the pruned filters. Besides, We note that SRFP and ASRFP pursue better results while slowing down the speed of convergence.

21.Noisy-LSTM: Improving Temporal Awareness for Video Semantic Segmentation ⬇️

Semantic video segmentation is a key challenge for various applications. This paper presents a new model named Noisy-LSTM, which is trainable in an end-to-end manner, with convolutional LSTMs (ConvLSTMs) to leverage the temporal coherency in video frames. We also present a simple yet effective training strategy, which replaces a frame in video sequence with noises. This strategy spoils the temporal coherency in video frames during training and thus makes the temporal links in ConvLSTMs unreliable, which may consequently improve feature extraction from video frames, as well as serve as a regularizer to avoid overfitting, without requiring extra data annotation or computational costs. Experimental results demonstrate that the proposed model can achieve state-of-the-art performances in both the CityScapes and EndoVis2018 datasets.

22.Synthesizing the Unseen for Zero-shot Object Detection ⬇️

The existing zero-shot detection approaches project visual features to the semantic domain for seen objects, hoping to map unseen objects to their corresponding semantics during inference. However, since the unseen objects are never visualized during training, the detection model is skewed towards seen content, thereby labeling unseen as background or a seen class. In this work, we propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain. Consequently, the major challenge becomes, how to accurately synthesize unseen objects merely using their class semantics? Towards this ambitious goal, we propose a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them. Further, using a unified model, we ensure the synthesized features have high diversity that represents the intra-class differences and variable localization precision in the detected bounding boxes. We test our approach on three object detection benchmarks, PASCAL VOC, MSCOCO, and ILSVRC detection, under both conventional and generalized settings, showing impressive gains over the state-of-the-art methods. Our codes are available at this https URL.

23.Comprehensive evaluation of no-reference image quality assessment algorithms on KADID-10k database ⬇️

The main goal of objective image quality assessment is to devise computational, mathematical models which are able to predict perceptual image quality consistently with subjective evaluations. The evaluation of objective image quality assessment algorithms is based on experiments conducted on publicly available benchmark databases. In this study, our goal is to give a comprehensive evaluation about no-reference image quality assessment algorithms, whose original source codes are available online, using the recently published KADID-10k database which is one of the largest available benchmark databases. Specifically, average PLCC, SROCC, and KROCC are reported which were measured over 100 random train-test splits. Furthermore, the database was divided into a train (appx. 80% of images) and a test set (appx. 20% of images) with respect to the reference images. So no semantic content overlap was between these two sets. Our evaluation results may be helpful to obtain a clear understanding about the status of state-of-the-art no-reference image quality assessment methods.

24.Image Captioning with Visual Object Representations Grounded in the Textual Modality ⬇️

We present our work in progress exploring the possibilities of a shared embedding space between textual and visual modality. Leveraging the textual nature of object detection labels and the hypothetical expressiveness of extracted visual object representations, we propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system instead of grounding words or sentences in their associated images. Based on the previous work, we apply additional grounding losses to the image captioning training objective aiming to force visual object representations to create more heterogeneous clusters based on their class label and copy a semantic structure of the word embedding space. In addition, we provide an analysis of the learned object vector space projection and its impact on the IC system performance. With only slight change in performance, grounded models reach the stopping criterion during training faster than the unconstrained model, needing about two to three times less training updates. Additionally, an improvement in structural correlation between the word embeddings and both original and projected object vectors suggests that the grounding is actually mutual.

25.SD-DefSLAM: Semi-Direct Monocular SLAM for Deformable and Intracorporeal Scenes ⬇️

Conventional SLAM techniques strongly rely on scene rigidity to solve data association, ignoring dynamic parts of the scene. In this work we present Semi-Direct DefSLAM (SD-DefSLAM), a novel monocular deformable SLAM method able to map highly deforming environments, built on top of DefSLAM. To robustly solve data association in challenging deforming scenes, SD-DefSLAM combines direct and indirect methods: an enhanced illumination-invariant Lucas-Kanade tracker for data association, geometric Bundle Adjustment for pose and deformable map estimation, and bag-of-words based on feature descriptors for camera relocation. Dynamic objects are detected and segmented-out using a CNN trained for the specific application domain. We thoroughly evaluate our system in two public datasets. The mandala dataset is a SLAM benchmark with increasingly aggressive deformations. The Hamlyn dataset contains intracorporeal sequences that pose serious real-life challenges beyond deformation like weak texture, specular reflections, surgical tools and occlusions. Our results show that SD-DefSLAM outperforms DefSLAM in point tracking, reconstruction accuracy and scale drift thanks to the improvement in all the data association steps, being the first system able to robustly perform SLAM inside the human body.

26.A combined full-reference image quality assessment method based on convolutional activation maps ⬇️

The goal of full-reference image quality assessment (FR-IQA) is to predict the quality of an image as perceived by human observers with using its pristine, reference counterpart. In this study, we explore a novel, combined approach which predicts the perceptual quality of a distorted image by compiling a feature vector from convolutional activation maps. More specifically, a reference-distorted image pair is run through a pretrained convolutional neural network and the activation maps are compared with a traditional image similarity metric. Subsequently, the resulted feature vector is mapped onto perceptual quality scores with the help of a trained support vector regressor. A detailed parameter study is also presented in which the design choices of the proposed method is reasoned. Furthermore, we study the relationship between the amount of training images and the prediction performance. Specifically, it is demonstrated that the proposed method can be trained with few amount of data to reach high prediction performance. Our best proposal - ActMapFeat - is compared to the state-of-the-art on six publicly available benchmark IQA databases, such as KADID-10k, TID2013, TID2008, MDID, CSIQ, and VCL-FER. Specifically, our method is able to significantly outperform the state-of-the-art on these benchmark databases. This paper is accompanied by the source code of the proposed method: this https URL.

27.SHREC 2020 track: 6D Object Pose Estimation ⬇️

6D pose estimation is crucial for augmented reality, virtual reality, robotic manipulation and visual navigation. However, the problem is challenging due to the variety of objects in the real world. They have varying 3D shape and their appearances in captured images are affected by sensor noise, changing lighting conditions and occlusions between objects. Different pose estimation methods have different strengths and weaknesses, depending on feature representations and scene contents. At the same time, existing 3D datasets that are used for data-driven methods to estimate 6D poses have limited view angles and low resolution. To address these issues, we organize the Shape Retrieval Challenge benchmark on 6D pose estimation and create a physically accurate simulator that is able to generate photo-realistic color-and-depth image pairs with corresponding ground truth 6D poses. From captured color and depth images, we use this simulator to generate a 3D dataset which has 400 photo-realistic synthesized color-and-depth image pairs with various view angles for training, and another 100 captured and synthetic images for testing. Five research groups register in this track and two of them submitted their results. Data-driven methods are the current trend in 6D object pose estimation and our evaluation results show that approaches which fully exploit the color and geometric features are more robust for 6D pose estimation of reflective and texture-less objects and occlusion. This benchmark and comparative evaluation results have the potential to further enrich and boost the research of 6D object pose estimation and its applications.

28.The efficacy of Neural Planning Metrics: A meta-analysis of PKL on nuScenes ⬇️

A high-performing object detection system plays a crucial role in autonomous driving (AD). The performance, typically evaluated in terms of mean Average Precision, does not take into account orientation and distance of the actors in the scene, which are important for the safe AD. It also ignores environmental context. Recently, Philion et al. proposed a neural planning metric (PKL), based on the KL divergence of a planner's trajectory and the groundtruth route, to accommodate these requirements. In this paper, we use this neural planning metric to score all submissions of the nuScenes detection challenge and analyze the results. We find that while somewhat correlated with mAP, the PKL metric shows different behavior to increased traffic density, ego velocity, road curvature and intersections. Finally, we propose ideas to extend the neural planning metric.

29.SelfVoxeLO: Self-supervised LiDAR Odometry with Voxel-based Deep Neural Networks ⬇️

Recent learning-based LiDAR odometry methods have demonstrated their competitiveness. However, most methods still face two substantial challenges: 1) the 2D projection representation of LiDAR data cannot effectively encode 3D structures from the point clouds; 2) the needs for a large amount of labeled data for training limit the application scope of these methods. In this paper, we propose a self-supervised LiDAR odometry method, dubbed SelfVoxeLO, to tackle these two difficulties. Specifically, we propose a 3D convolution network to process the raw LiDAR data directly, which extracts features that better encode the 3D geometric patterns. To suit our network to self-supervised learning, we design several novel loss functions that utilize the inherent properties of LiDAR point clouds. Moreover, an uncertainty-aware mechanism is incorporated in the loss functions to alleviate the interference of moving objects/noises. We evaluate our method's performances on two large-scale datasets, i.e., KITTI and Apollo-SouthBay. Our method outperforms state-of-the-art unsupervised methods by 27%/32% in terms of translational/rotational errors on the KITTI dataset and also performs well on the Apollo-SouthBay dataset. By including more unlabelled training data, our method can further improve performance comparable to the supervised methods.

30.SMA-STN: Segmented Movement-Attending Spatiotemporal Network forMicro-Expression Recognition ⬇️

Correctly perceiving micro-expression is difficult since micro-expression is an involuntary, repressed, and subtle facial expression, and efficiently revealing the subtle movement changes and capturing the significant segments in a micro-expression sequence is the key to micro-expression recognition (MER). To handle the crucial issue, in this paper, we firstly propose a dynamic segmented sparse imaging module (DSSI) to compute dynamic images as local-global spatiotemporal descriptors under a unique sampling protocol, which reveals the subtle movement changes visually in an efficient way. Secondly, a segmented movement-attending spatiotemporal network (SMA-STN) is proposed to further unveil imperceptible small movement changes, which utilizes a spatiotemporal movement-attending module (STMA) to capture long-distance spatial relation for facial expression and weigh temporal segments. Besides, a deviation enhancement loss (DE-Loss) is embedded in the SMA-STN to enhance the robustness of SMA-STN to subtle movement changes in feature level. Extensive experiments on three widely used benchmarks, i.e., CASME II, SAMM, and SHIC, show that the proposed SMA-STN achieves better MER performance than other state-of-the-art methods, which proves that the proposed method is effective to handle the challenging MER problem.

31.Semantic-Guided Inpainting Network for Complex Urban Scenes Manipulation ⬇️

Manipulating images of complex scenes to reconstruct, insert and/or remove specific object instances is a challenging task. Complex scenes contain multiple semantics and objects, which are frequently cluttered or ambiguous, thus hampering the performance of inpainting models. Conventional techniques often rely on structural information such as object contours in multi-stage approaches that generate unreliable results and boundaries. In this work, we propose a novel deep learning model to alter a complex urban scene by removing a user-specified portion of the image and coherently inserting a new object (e.g. car or pedestrian) in that scene. Inspired by recent works on image inpainting, our proposed method leverages the semantic segmentation to model the content and structure of the image, and learn the best shape and location of the object to insert. To generate reliable results, we design a new decoder block that combines the semantic segmentation and generation task to guide better the generation of new objects and scenes, which have to be semantically consistent with the image. Our experiments, conducted on two large-scale datasets of urban scenes (Cityscapes and Indian Driving), show that our proposed approach successfully address the problem of semantically-guided inpainting of complex urban scene.

32.A Two-stage Unsupervised Approach for Low light Image Enhancement ⬇️

As vision based perception methods are usually built on the normal light assumption, there will be a serious safety issue when deploying them into low light environments. Recently, deep learning based methods have been proposed to enhance low light images by penalizing the pixel-wise loss of low light and normal light images. However, most of them suffer from the following problems: 1) the need of pairs of low light and normal light images for training, 2) the poor performance for dark images, 3) the amplification of noise. To alleviate these problems, in this paper, we propose a two-stage unsupervised method that decomposes the low light image enhancement into a pre-enhancement and a post-refinement problem. In the first stage, we pre-enhance a low light image with a conventional Retinex based method. In the second stage, we use a refinement network learned with adversarial training for further improvement of the image quality. The experimental results show that our method outperforms previous methods on four benchmark datasets. In addition, we show that our method can significantly improve feature points matching and simultaneous localization and mapping in low light conditions.

33.Language and Visual Entity Relationship Graph for Agent Navigation ⬇️

Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions. From both the textual and visual perspectives, we find that the relationships among the scene, its objects,and directional clues are essential for the agent to interpret complex instructions and correctly perceive the environment. To capture and utilize the relationships, we propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision, and the intra-modal relationships among visual entities. We propose a message passing algorithm for propagating information between language elements and visual entities in the graph, which we then combine to determine the next action to take. Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art. On the Room-to-Room (R2R) benchmark, our method achieves the new best performance on the test unseen split with success rate weighted by path length (SPL) of 52%. On the Room-for-Room (R4R) dataset, our method significantly improves the previous best from 13% to 34% on the success weighted by normalized dynamic time warping (SDTW). Code is available at: this https URL.

34.Double-Uncertainty Weighted Method for Semi-supervised Learning ⬇️

Though deep learning has achieved advanced performance recently, it remains a challenging task in the field of medical imaging, as obtaining reliable labeled training data is time-consuming and expensive. In this paper, we propose a double-uncertainty weighted method for semi-supervised segmentation based on the teacher-student model. The teacher model provides guidance for the student model by penalizing their inconsistent prediction on both labeled and unlabeled data. We train the teacher model using Bayesian deep learning to obtain double-uncertainty, i.e. segmentation uncertainty and feature uncertainty. It is the first to extend segmentation uncertainty estimation to feature uncertainty, which reveals the capability to capture information among channels. A learnable uncertainty consistency loss is designed for the unsupervised learning process in an interactive manner between prediction and uncertainty. With no ground-truth for supervision, it can still incentivize more accurate teacher's predictions and facilitate the model to reduce uncertain estimations. Furthermore, our proposed double-uncertainty serves as a weight on each inconsistency penalty to balance and harmonize supervised and unsupervised training processes. We validate the proposed feature uncertainty and loss function through qualitative and quantitative analyses. Experimental results show that our method outperforms the state-of-the-art uncertainty-based semi-supervised methods on two public medical datasets.

35.Bi-Real Net V2: Rethinking Non-linearity for 1-bit CNNs and Going Beyond ⬇️

Binary neural networks (BNNs), where both weights and activations are binarized into 1 bit, have been widely studied in recent years due to its great benefit of highly accelerated computation and substantially reduced memory footprint that appeal to the development of resource constrained devices. In contrast to previous methods tending to reduce the quantization error for training BNN structures, we argue that the binarized convolution process owns an increasing linearity towards the target of minimizing such error, which in turn hampers BNN's discriminative ability. In this paper, we re-investigate and tune proper non-linear modules to fix that contradiction, leading to a strong baseline which achieves state-of-the-art performance on the large-scale ImageNet dataset in terms of accuracy and training efficiency. To go further, we find that the proposed BNN model still has much potential to be compressed by making a better use of the efficient binary operations, without losing accuracy. In addition, the limited capacity of the BNN model can also be increased with the help of group execution. Based on these insights, we are able to improve the baseline with an additional 4~5% top-1 accuracy gain even with less computational cost. Our code will be made public at this https URL.

36.Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition ⬇️

Video-based person recognition is challenging due to persons being blocked and blurred, and the variation of shooting angle. Previous research always focused on person recognition on still images, ignoring similarity and continuity between video frames. To tackle the challenges above, we propose a novel Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition, which aggregates face features and incorporates them with multi-modal information to identify persons in videos. For frame aggregation, we propose a novel trainable layer based on NetVLAD (named AttentionVLAD), which takes arbitrary number of features as input and computes a fixed-length aggregation feature based on feature quality. We show that introducing an attention mechanism to NetVLAD can effectively decrease the impact of low-quality frames. For the multi-model information of videos, we propose a Multi-Layer Multi-Modal Attention (MLMA) module to learn the correlation of multi-modality by adaptively updating Gram matrix. Experimental results on iQIYI-VID-2019 dataset show that our framework outperforms other state-of-the-art methods.

37.Modality-Pairing Learning for Brain Tumor Segmentation ⬇️

Automatic brain tumor segmentation from multi-modality Magnetic Resonance Images (MRI) using deep learning methods plays an important role in assisting the diagnosis and treatment of brain tumor. However, previous methods mostly ignore the latent relationship among different modalities. In this work, we propose a novel end-to-end Modality-Pairing learning method for brain tumor segmentation. Paralleled branches are designed to exploit different modality features and a series of layer connections are utilized to capture complex relationships and abundant information among modalities. We also use a consistency loss to minimize the prediction variance between two branches. Besides, learning rate warmup strategy is adopted to solve the problem of the training instability and early over-fitting. Lastly, we use average ensemble of multiple models and some post-processing techniques to get final results. Our method is tested on the BraTS 2020 validation dataset, obtaining promising segmentation performance, with average dice scores of $0.908, 0.856, 0.787$ for the whole tumor, tumor core and enhancing tumor, respectively. We won the second place of the BraTS 2020 Challenge for the tumor segmentation on the testing dataset.

38.DeepReflecs: Deep Learning for Automotive Object Classification with Radar Reflections ⬇️

This paper presents an novel object type classification method for automotive applications which uses deep learning with radar reflections. The method provides object class information such as pedestrian, cyclist, car, or non-obstacle. The method is both powerful and efficient, by using a light-weight deep learning approach on reflection level radar data. It fills the gap between low-performant methods of handcrafted features and high-performant methods with convolutional neural networks. The proposed network exploits the specific characteristics of radar reflection data: It handles unordered lists of arbitrary length as input and it combines both extraction of local and global features. In experiments with real data the proposed network outperforms existing methods of handcrafted or learned features. An ablation study analyzes the impact of the proposed global context layer.

39.Extraction of Discrete Spectra Modes from Video Data Using a Deep Convolutional Koopman Network ⬇️

Recent deep learning extensions in Koopman theory have enabled compact, interpretable representations of nonlinear dynamical systems which are amenable to linear analysis. Deep Koopman networks attempt to learn the Koopman eigenfunctions which capture the coordinate transformation to globally linearize system dynamics. These eigenfunctions can be linked to underlying system modes which govern the dynamical behavior of the system. While many related techniques have demonstrated their efficacy on canonical systems and their associated state variables, in this work the system dynamics are observed optically (i.e. in video format). We demonstrate the ability of a deep convolutional Koopman network (CKN) in automatically identifying independent modes for dynamical systems with discrete spectra. Practically, this affords flexibility in system data collection as the data are easily obtainable observable variables. The learned models are able to successfully and robustly identify the underlying modes governing the system, even with a redundantly large embedding space. Modal disaggregation is encouraged using a simple masking procedure. All of the systems analyzed in this work use an identical network architecture.

40.MCGKT-Net: Multi-level Context Gating Knowledge Transfer Network for Single Image Deraining ⬇️

Rain streak removal in a single image is a very challenging task due to its ill-posed nature in essence. Recently, the end-to-end learning techniques with deep convolutional neural networks (DCNN) have made great progress in this task. However, the conventional DCNN-based deraining methods have struggled to exploit deeper and more complex network architectures for pursuing better performance. This study proposes a novel MCGKT-Net for boosting deraining performance, which is a naturally multi-scale learning framework being capable of exploring multi-scale attributes of rain streaks and different semantic structures of the clear images. In order to obtain high representative features inside MCGKT-Net, we explore internal knowledge transfer module using ConvLSTM unit for conducting interaction learning between different layers and investigate external knowledge transfer module for leveraging the knowledge already learned in other task domains. Furthermore, to dynamically select useful features in learning procedure, we propose a multi-scale context gating module in the MCGKT-Net using squeeze-and-excitation block. Experiments on three benchmark datasets: Rain100H, Rain100L, and Rain800, manifest impressive performance compared with state-of-the-art methods.

41.Continual Unsupervised Domain Adaptation with Adversarial Learning ⬇️

Unsupervised Domain Adaptation (UDA) is essential for autonomous driving due to a lack of labeled real-world road images. Most of the existing UDA methods, however, have focused on a single-step domain adaptation (Synthetic-to-Real). These methods overlook a change in environments in the real world as time goes by. Thus, developing a domain adaptation method for sequentially changing target domains without catastrophic forgetting is required for real-world applications. To deal with the problem above, we propose Continual Unsupervised Domain Adaptation with Adversarial learning (CUDA^2) framework, which can generally be applicable to other UDA methods conducting adversarial learning. CUDA^2 framework generates a sub-memory, called Target-specific Memory (TM) for each new target domain guided by Double Hinge Adversarial (DHA) loss. TM prevents catastrophic forgetting by storing target-specific information, and DHA loss induces a synergy between the existing network and the expanded TM. To the best of our knowledge, we consider realistic autonomous driving scenarios (Synthetic-to-Real-to-Real) in UDA research for the first time. The model with our framework outperforms other state-of-the-art models under the same settings. Besides, extensive experiments are conducted as ablation studies for in-depth analysis.

42.Intelligent Reference Curation for Visual Place Recognition via Bayesian Selective Fusion ⬇️

The key challenge of visual place recognition (VPR) lies in recognizing places despite drastic visual appearance changes due to factors such as time of day, season, or weather or lighting conditions. Numerous approaches based on deep-learnt image descriptors, sequence matching, domain translation, and probabilistic localization have had success in addressing this challenge, but most rely on the availability of carefully curated representative reference images of the possible places. In this paper, we propose a novel approach, dubbed Bayesian Selective Fusion, for actively selecting and fusing informative reference images to determine the best place match for a given query image. The selective element of our approach avoids the counterproductive fusion of every reference image and enables the dynamic selection of informative reference images in environments with changing visual conditions (such as indoors with flickering lights, outdoors during sunshowers or over the day-night cycle). The probabilistic element of our approach provides a means of fusing multiple reference images that accounts for their varying uncertainty via a novel training-free likelihood function for VPR. On difficult query images from two benchmark datasets, we demonstrate that our approach matches and exceeds the performance of several alternative fusion approaches along with state-of-the-art techniques that are provided with a priori (unfair) knowledge of the best reference images. Our approach is well suited for long-term robot autonomy where dynamic visual environments are commonplace since it is training-free, descriptor-agnostic, and complements existing techniques such as sequence matching.

43.Self-Supervised Visual Attention Learning for Vehicle Re-Identification ⬇️

Visual attention learning (VAL) aims to produce a confidence map as weights to detect discriminative features in each image for certain task such as vehicle re-identification (ReID) where the same vehicle instance needs to be identified across different cameras. In contrast to the literature, in this paper we propose utilizing self-supervised learning to regularize VAL to improving the performance for vehicle ReID. Mathematically using lifting we can factorize the two functions of VAL and self-supervised regularization through another shared function. We implement such factorization using a deep learning framework consisting of three branches: (1) a global branch as backbone for image feature extraction, (2) an attentional branch for producing attention masks, and (3) a self-supervised branch for regularizing the attention learning. Our network design naturally leads to an end-to-end multi-task joint optimization. We conduct comprehensive experiments on three benchmark datasets for vehicle ReID, i.e., VeRi-776, CityFlow-ReID, and VehicleID. We demonstrate the state-of-the-art (SOTA) performance of our approach with the capability of capturing informative vehicle parts with no corresponding manual labels. We also demonstrate the good generalization of our approach in other ReID tasks such as person ReID and multi-target multi-camera tracking.

44.Unsupervised Domain Adaptation for Spatio-Temporal Action Localization ⬇️

Spatio-temporal action localization is an important problem in computer vision that involves detecting where and when activities occur, and therefore requires modeling of both spatial and temporal features. This problem is typically formulated in the context of supervised learning, where the learned classifiers operate on the premise that both training and test data are sampled from the same underlying distribution. However, this assumption does not hold when there is a significant domain shift, leading to poor generalization performance on the test data. To address this, we focus on the hard and novel task of generalizing training models to test samples without access to any labels from the latter for spatio-temporal action localization by proposing an end-to-end unsupervised domain adaptation algorithm. We extend the state-of-the-art object detection framework to localize and classify actions. In order to minimize the domain shift, three domain adaptation modules at image level (temporal and spatial) and instance level (temporal) are designed and integrated. We design a new experimental setup and evaluate the proposed method and different adaptation modules on the UCF-Sports, UCF-101 and JHMDB benchmark datasets. We show that significant performance gain can be achieved when spatial and temporal features are adapted separately, or jointly for the most effective results.

45.Rotation Invariant Aerial Image Retrieval with Group Convolutional Metric Learning ⬇️

Remote sensing image retrieval (RSIR) is the process of ranking database images depending on the degree of similarity compared to the query image. As the complexity of RSIR increases due to the diversity in shooting range, angle, and location of remote sensors, there is an increasing demand for methods to address these issues and improve retrieval performance. In this work, we introduce a novel method for retrieving aerial images by merging group convolution with attention mechanism and metric learning, resulting in robustness to rotational variations. For refinement and emphasis on important features, we applied channel attention in each group convolution stage. By utilizing the characteristics of group convolution and channel-wise attention, it is possible to acknowledge the equality among rotated but identically located images. The training procedure has two main steps: (i) training the network with Aerial Image Dataset (AID) for classification, (ii) fine-tuning the network with triplet-loss for retrieval with Google Earth South Korea and NWPU-RESISC45 datasets. Results show that the proposed method performance exceeds other state-of-the-art retrieval methods in both rotated and original environments. Furthermore, we utilize class activation maps (CAM) to visualize the distinct difference of main features between our method and baseline, resulting in better adaptability in rotated environments.

46.MaskNet: A Fully-Convolutional Network to Estimate Inlier Points ⬇️

Point clouds have grown in importance in the way computers perceive the world. From LIDAR sensors in autonomous cars and drones to the time of flight and stereo vision systems in our phones, point clouds are everywhere. Despite their ubiquity, point clouds in the real world are often missing points because of sensor limitations or occlusions, or contain extraneous points from sensor noise or artifacts. These problems challenge algorithms that require computing correspondences between a pair of point clouds. Therefore, this paper presents a fully-convolutional neural network that identifies which points in one point cloud are most similar (inliers) to the points in another. We show improvements in learning-based and classical point cloud registration approaches when retrofitted with our network. We demonstrate these improvements on synthetic and real-world datasets. Finally, our network produces impressive results on test datasets that were unseen during training, thus exhibiting generalizability. Code and videos are available at this https URL

47.Gaussian Constrained Attention Network for Scene Text Recognition ⬇️

Scene text recognition has been a hot topic in computer vision. Recent methods adopt the attention mechanism for sequence prediction which achieve convincing results. However, we argue that the existing attention mechanism faces the problem of attention diffusion, in which the model may not focus on a certain character area. In this paper, we propose Gaussian Constrained Attention Network to deal with this problem. It is a 2D attention-based method integrated with a novel Gaussian Constrained Refinement Module, which predicts an additional Gaussian mask to refine the attention weights. Different from adopting an additional supervision on the attention weights simply, our proposed method introduces an explicit refinement. In this way, the attention weights will be more concentrated and the attention-based recognition network achieves better performance. The proposed Gaussian Constrained Refinement Module is flexible and can be applied to existing attention-based methods directly. The experiments on several benchmark datasets demonstrate the effectiveness of our proposed method. Our code has been available at this https URL.

48.Localized Interactive Instance Segmentation ⬇️

In current interactive instance segmentation works, the user is granted a free hand when providing clicks to segment an object; clicks are allowed on background pixels and other object instances far from the target object. This form of interaction is highly inconsistent with the end goal of efficiently isolating objects of interest. In our work, we propose a clicking scheme wherein user interactions are restricted to the proximity of the object. In addition, we propose a novel transformation of the user-provided clicks to generate a weak localization prior on the object which is consistent with image structures such as edges, textures etc. We demonstrate the effectiveness of our proposed clicking scheme and localization strategy through detailed experimentation in which we raise state-of-the-art on several standard interactive segmentation benchmarks.

49.Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering ⬇️

Differentiable rendering has paved the way to training neural networks to perform "inverse graphics" tasks such as predicting 3D geometry from monocular photographs. To train high performing models, most of the current approaches rely on multi-view imagery which are not readily available in practice. Recent Generative Adversarial Networks (GANs) that synthesize images, in contrast, seem to acquire 3D knowledge implicitly during training: object viewpoints can be manipulated by simply manipulating the latent codes. However, these latent codes often lack further physical interpretation and thus GANs cannot easily be inverted to perform explicit 3D reasoning. In this paper, we aim to extract and disentangle 3D knowledge learned by generative models by utilizing differentiable renderers. Key to our approach is to exploit GANs as a multi-view data generator to train an inverse graphics network using an off-the-shelf differentiable renderer, and the trained inverse graphics network as a teacher to disentangle the GAN's latent code into interpretable 3D properties. The entire architecture is trained iteratively using cycle consistency losses. We show that our approach significantly outperforms state-of-the-art inverse graphics networks trained on existing datasets, both quantitatively and via user studies. We further showcase the disentangled GAN as a controllable 3D "neural renderer", complementing traditional graphics renderers.

50.Movement-induced Priors for Deep Stereo ⬇️

We propose a method for fusing stereo disparity estimation with movement-induced prior information. Instead of independent inference frame-by-frame, we formulate the problem as a non-parametric learning task in terms of a temporal Gaussian process prior with a movement-driven kernel for inter-frame reasoning. We present a hierarchy of three Gaussian process kernels depending on the availability of motion information, where our main focus is on a new gyroscope-driven kernel for handheld devices with low-quality MEMS sensors, thus also relaxing the requirement of having full 6D camera poses available. We show how our method can be combined with two state-of-the-art deep stereo methods. The method either work in a plug-and-play fashion with pre-trained deep stereo networks, or further improved by jointly training the kernels together with encoder-decoder architectures, leading to consistent improvement.

51.Variational Capsule Encoder ⬇️

We propose a novel capsule network based variational encoder architecture, called Bayesian capsules (B-Caps), to modulate the mean and standard deviation of the sampling distribution in the latent space. We hypothesized that this approach can learn a better representation of features in the latent space than traditional approaches. Our hypothesis was tested by using the learned latent variables for image reconstruction task, where for MNIST and Fashion-MNIST datasets, different classes were separated successfully in the latent space using our proposed model. Our experimental results have shown improved reconstruction and classification performances for both datasets adding credence to our hypothesis. We also showed that by increasing the latent space dimension, the proposed B-Caps was able to learn a better representation when compared to the traditional variational auto-encoders (VAE). Hence our results indicate the strength of capsule networks in representation learning which has never been examined under the VAE settings before.

52.View-Invariant Gait Recognition with Attentive Recurrent Learning of Partial Representations ⬇️

Gait recognition refers to the identification of individuals based on features acquired from their body movement during walking. Despite the recent advances in gait recognition with deep learning, variations in data acquisition and appearance, namely camera angles, subject pose, occlusions, and clothing, are challenging factors that need to be considered for achieving accurate gait recognition systems. In this paper, we propose a network that first learns to extract gait convolutional energy maps (GCEM) from frame-level convolutional features. It then adopts a bidirectional recurrent neural network to learn from split bins of the GCEM, thus exploiting the relations between learned partial spatiotemporal representations. We then use an attention mechanism to selectively focus on important recurrently learned partial representations as identity information in different scenarios may lie in different GCEM bins. Our proposed model has been extensively tested on two large-scale CASIA-B and OU-MVLP gait datasets using four different test protocols and has been compared to a number of state-of-the-art and baseline solutions. Additionally, a comprehensive experiment has been performed to study the robustness of our model in the presence of six different synthesized occlusions. The experimental results show the superiority of our proposed method, outperforming the state-of-the-art, especially in scenarios where different clothing and carrying conditions are encountered. The results also revealed that our model is more robust against different occlusions as compared to the state-of-the-art methods.

53.Gait Recognition using Multi-Scale Partial Representation Transformation with Capsules ⬇️

Gait recognition, referring to the identification of individuals based on the manner in which they walk, can be very challenging due to the variations in the viewpoint of the camera and the appearance of individuals. Current methods for gait recognition have been dominated by deep learning models, notably those based on partial feature representations. In this context, we propose a novel deep network, learning to transfer multi-scale partial gait representations using capsules to obtain more discriminative gait features. Our network first obtains multi-scale partial representations using a state-of-the-art deep partial feature extractor. It then recurrently learns the correlations and co-occurrences of the patterns among the partial features in forward and backward directions using Bi-directional Gated Recurrent Units (BGRU). Finally, a capsule network is adopted to learn deeper part-whole relationships and assigns more weights to the more relevant features while ignoring the spurious dimensions. That way, we obtain final features that are more robust to both viewing and appearance changes. The performance of our method has been extensively tested on two gait recognition datasets, CASIA-B and OU-MVLP, using four challenging test protocols. The results of our method have been compared to the state-of-the-art gait recognition solutions, showing the superiority of our model, notably when facing challenging viewing and carrying conditions.

54.Graphite: GRAPH-Induced feaTure Extraction for Point Cloud Registration ⬇️

3D Point clouds are a rich source of information that enjoy growing popularity in the vision community. However, due to the sparsity of their representation, learning models based on large point clouds is still a challenge. In this work, we introduce Graphite, a GRAPH-Induced feaTure Extraction pipeline, a simple yet powerful feature transform and keypoint detector. Graphite enables intensive down-sampling of point clouds with keypoint detection accompanied by a descriptor. We construct a generic graph-based learning scheme to describe point cloud regions and extract salient points. To this end, we take advantage of 6D pose information and metric learning to learn robust descriptions and keypoints across different scans. We Reformulate the 3D keypoint pipeline with graph neural networks which allow efficient processing of the point set while boosting its descriptive power which ultimately results in more accurate 3D registrations. We demonstrate our lightweight descriptor on common 3D descriptor matching and point cloud registration benchmarks and achieve comparable results with the state of the art. Describing 100 patches of a point cloud and detecting their keypoints takes only ~0.018 seconds with our proposed network.

55.RADIATE: A Radar Dataset for Automotive Perception ⬇️

Datasets for autonomous cars are essential for the development and benchmarking of perception systems. However, most existing datasets are captured with camera and LiDAR sensors in good weather conditions. In this paper, we present the RAdar Dataset In Adverse weaThEr (RADIATE), aiming to facilitate research on object detection, tracking and scene understanding using radar sensing for safe autonomous driving. RADIATE includes 3 hours of annotated radar images with more than 200K labelled road actors in total, on average about 4.6 instances per radar image. It covers 8 different categories of actors in a variety of weather conditions (e.g., sun, night, rain, fog and snow) and driving scenarios (e.g., parked, urban, motorway and suburban), representing different levels of challenge. To the best of our knowledge, this is the first public radar dataset which provides high-resolution radar images on public roads with a large amount of road actors labelled. The data collected in adverse weather, e.g., fog and snowfall, is unique. Some baseline results of radar based object detection and recognition are given to show that the use of radar data is promising for automotive applications in bad weather, where vision and LiDAR can fail. RADIATE also has stereo images, 32-channel LiDAR and GPS data, directed at other applications such as sensor fusion, localisation and mapping. The public dataset can be accessed at this http URL.

56.Multimodal semantic forecasting based on conditional generation of future features ⬇️

This paper considers semantic forecasting in road-driving scenes. Most existing approaches address this problem as deterministic regression of future features or future predictions given observed frames. However, such approaches ignore the fact that future can not always be guessed with certainty. For example, when a car is about to turn around a corner, the road which is currently occluded by buildings may turn out to be either free to drive, or occupied by people, other vehicles or roadworks. When a deterministic model confronts such situation, its best guess is to forecast the most likely outcome. However, this is not acceptable since it defeats the purpose of forecasting to improve security. It also throws away valuable training data, since a deterministic model is unable to learn any deviation from the norm. We address this problem by providing more freedom to the model through allowing it to forecast different futures. We propose to formulate multimodal forecasting as sampling of a multimodal generative model conditioned on the observed frames. Experiments on the Cityscapes dataset reveal that our multimodal model outperforms its deterministic counterpart in short-term forecasting while performing slightly worse in the mid-term case.

57.Exploiting Context for Robustness to Label Noise in Active Learning ⬇️

Several works in computer vision have demonstrated the effectiveness of active learning for adapting the recognition model when new unlabeled data becomes available. Most of these works consider that labels obtained from the annotator are correct. However, in a practical scenario, as the quality of the labels depends on the annotator, some of the labels might be wrong, which results in degraded recognition performance. In this paper, we address the problems of i) how a system can identify which of the queried labels are wrong and ii) how a multi-class active learning system can be adapted to minimize the negative impact of label noise. Towards solving the problems, we propose a noisy label filtering based learning approach where the inter-relationship (context) that is quite common in natural data is utilized to detect the wrong labels. We construct a graphical representation of the unlabeled data to encode these relationships and obtain new beliefs on the graph when noisy labels are available. Comparing the new beliefs with the prior relational information, we generate a dissimilarity score to detect the incorrect labels and update the recognition model with correct labels which result in better recognition performance. This is demonstrated in three different applications: scene classification, activity classification, and document classification.

58.Deep Structured Prediction for Facial Landmark Detection ⬇️

Existing deep learning based facial landmark detection methods have achieved excellent performance. These methods, however, do not explicitly embed the structural dependencies among landmark points. They hence cannot preserve the geometric relationships between landmark points or generalize well to challenging conditions or unseen data. This paper proposes a method for deep structured facial landmark detection based on combining a deep Convolutional Network with a Conditional Random Field. We demonstrate its superior performance to existing state-of-the-art techniques in facial landmark detection, especially a better generalization ability on challenging datasets that include large pose and occlusion.

59.Covapixels ⬇️

We propose and discuss the summarization of superpixel-type image tiles/patches using mean and covariance information. We refer to the resulting objects as covapixels.

60.FGAGT: Flow-Guided Adaptive Graph Tracking ⬇️

Multi-object tracking (MOT) has always been a very important research direction in computer vision and has great applications in autonomous driving, video object behavior prediction, traffic management, and accident prevention. Recently, some methods have made great progress on MOT, such as CenterTrack, which predicts the trajectory position based on optical flow then tracks it, and FairMOT, which uses higher resolution feature maps to extract Re-id features. In this article, we propose the FGAGT tracker. Different from FairMOT, we use Pyramid Lucas Kanade optical flow method to predict the position of the historical objects in the current frame, and use ROI Pooling\cite{He2015} and fully connected layers to extract the historical objects' appearance feature vectors on the feature maps of the current frame. Next, input them and new objects' feature vectors into the adaptive graph neural network to update the feature vectors. The adaptive graph network can update the feature vectors of the objects by combining historical global position and appearance information. Because the historical information is preserved, it can also re-identify the occluded objects. In the training phase, we propose the Balanced MSE LOSS to balance the sample distribution. In the Inference phase, we use the Hungarian algorithm for data association. Our method reaches the level of state-of-the-art, where the MOTA index exceeds FairMOT by 2.5 points, and CenterTrack by 8.4 points on the MOT17 dataset, exceeds FairMOT by 7.2 points on the MOT16 dataset.

61.Image-based Automated Species Identification: Can Virtual Data Augmentation Overcome Problems of Insufficient Sampling? ⬇️

Automated species identification and delimitation is challenging, particularly in rare and thus often scarcely sampled species, which do not allow sufficient discrimination of infraspecific versus interspecific variation. Typical problems arising from either low or exaggerated interspecific morphological differentiation are best met by automated methods of machine learning that learn efficient and effective species identification from training samples. However, limited infraspecific sampling remains a key challenge also in machine learning. 1In this study, we assessed whether a two-level data augmentation approach may help to overcome the problem of scarce training data in automated visual species identification. The first level of visual data augmentation applies classic approaches of data augmentation and generation of faked images using a GAN approach. Descriptive feature vectors are derived from bottleneck features of a VGG-16 convolutional neural network (CNN) that are then stepwise reduced in dimensionality using Global Average Pooling and PCA to prevent overfitting. The second level of data augmentation employs synthetic additional sampling in feature space by an oversampling algorithm in vector space (SMOTE). Applied on two challenging datasets of scarab beetles (Coleoptera), our augmentation approach outperformed a non-augmented deep learning baseline approach as well as a traditional 2D morphometric approach (Procrustes analysis).

62.Multiple Future Prediction Leveraging Synthetic Trajectories ⬇️

Trajectory prediction is an important task, especially in autonomous driving. The ability to forecast the position of other moving agents can yield to an effective planning, ensuring safety for the autonomous vehicle as well for the observed entities. In this work we propose a data driven approach based on Markov Chains to generate synthetic trajectories, which are useful for training a multiple future trajectory predictor. The advantages are twofold: on the one hand synthetic samples can be used to augment existing datasets and train more effective predictors; on the other hand, it allows to generate samples with multiple ground truths, corresponding to diverse equally likely outcomes of the observed trajectory. We define a trajectory prediction model and a loss that explicitly address the multimodality of the problem and we show that combining synthetic and real data leads to prediction improvements, obtaining state of the art results.

63.Temporal Binary Representation for Event-Based Action Recognition ⬇️

In this paper we present an event aggregation strategy to convert the output of an event camera into frames processable by traditional Computer Vision algorithms. The proposed method first generates sequences of intermediate binary representations, which are then losslessly transformed into a compact format by simply applying a binary-to-decimal conversion. This strategy allows us to encode temporal information directly into pixel values, which are then interpreted by deep learning models. We apply our strategy, called Temporal Binary Representation, to the task of Gesture Recognition, obtaining state of the art results on the popular DVS128 Gesture Dataset. To underline the effectiveness of the proposed method compared to existing ones, we also collect an extension of the dataset under more challenging conditions on which to perform experiments.

64.Distortion-aware Monocular Depth Estimation for Omnidirectional Images ⬇️

A main challenge for tasks on panorama lies in the distortion of objects among images. In this work, we propose a Distortion-Aware Monocular Omnidirectional (DAMO) dense depth estimation network to address this challenge on indoor panoramas with two steps. First, we introduce a distortion-aware module to extract calibrated semantic features from omnidirectional images. Specifically, we exploit deformable convolution to adjust its sampling grids to geometric variations of distorted objects on panoramas and then utilize a strip pooling module to sample against horizontal distortion introduced by inverse gnomonic projection. Second, we further introduce a plug-and-play spherical-aware weight matrix for our objective function to handle the uneven distribution of areas projected from a sphere. Experiments on the 360D dataset show that the proposed method can effectively extract semantic features from distorted panoramas and alleviate the supervision bias caused by distortion. It achieves state-of-the-art performance on the 360D dataset with high efficiency.

65.Boosting High-Level Vision with Joint Compression Artifacts Reduction and Super-Resolution ⬇️

Due to the limits of bandwidth and storage space, digital images are usually down-scaled and compressed when transmitted over networks, resulting in loss of details and jarring artifacts that can lower the performance of high-level visual tasks. In this paper, we aim to generate an artifact-free high-resolution image from a low-resolution one compressed with an arbitrary quality factor by exploring joint compression artifacts reduction (CAR) and super-resolution (SR) tasks. First, we propose a context-aware joint CAR and SR neural network (CAJNN) that integrates both local and non-local features to solve CAR and SR in one-stage. Finally, a deep reconstruction network is adopted to predict high quality and high-resolution images. Evaluation on CAR and SR benchmark datasets shows that our CAJNN model outperforms previous methods and also takes 26.2% shorter runtime. Based on this model, we explore addressing two critical challenges in high-level computer vision: optical character recognition of low-resolution texts, and extremely tiny face detection. We demonstrate that CAJNN can serve as an effective image preprocessing method and improve the accuracy for real-scene text recognition (from 85.30% to 85.75%) and the average precision for tiny face detection (from 0.317 to 0.611).

66.Sensitivity and Specificity Evaluation of Deep Learning Models for Detection of Pneumoperitoneum on Chest Radiographs ⬇️

Background: Deep learning has great potential to assist with detecting and triaging critical findings such as pneumoperitoneum on medical images. To be clinically useful, the performance of this technology still needs to be validated for generalizability across different types of imaging systems. Materials and Methods: This retrospective study included 1,287 chest X-ray images of patients who underwent initial chest radiography at 13 different hospitals between 2011 and 2019. The chest X-ray images were labelled independently by four radiologist experts as positive or negative for pneumoperitoneum. State-of-the-art deep learning models (ResNet101, InceptionV3, DenseNet161, and ResNeXt101) were trained on a subset of this dataset, and the automated classification performance was evaluated on the rest of the dataset by measuring the AUC, sensitivity, and specificity for each model. Furthermore, the generalizability of these deep learning models was assessed by stratifying the test dataset according to the type of the utilized imaging systems. Results: All deep learning models performed well for identifying radiographs with pneumoperitoneum, while DenseNet161 achieved the highest AUC of 95.7%, Specificity of 89.9%, and Sensitivity of 91.6%. DenseNet161 model was able to accurately classify radiographs from different imaging systems (Accuracy: 90.8%), while it was trained on images captured from a specific imaging system from a single institution. This result suggests the generalizability of our model for learning salient features in chest X-ray images to detect pneumoperitoneum, independent of the imaging system.

67.Finding Physical Adversarial Examples for Autonomous Driving with Fast and Differentiable Image Compositing ⬇️

There is considerable evidence that deep neural networks are vulnerable to adversarial perturbations applied directly to their digital inputs. However, it remains an open question whether this translates to vulnerabilities in real-world systems. Specifically, in the context of image inputs to autonomous driving systems, an attack can be achieved only by modifying the physical environment, so as to ensure that the resulting stream of video inputs to the car's controller leads to incorrect driving decisions. Inducing this effect on the video inputs indirectly through the environment requires accounting for system dynamics and tracking viewpoint changes. We propose a scalable and efficient approach for finding adversarial physical modifications, using a differentiable approximation for the mapping from environmental modifications-namely, rectangles drawn on the road-to the corresponding video inputs to the controller network. Given the color, location, position, and orientation parameters of the rectangles, our mapping composites them onto pre-recorded video streams of the original environment. Our mapping accounts for geometric and color variations, is differentiable with respect to rectangle parameters, and uses multiple original video streams obtained by varying the driving trajectory. When combined with a neural network-based controller, our approach allows the design of adversarial modifications through end-to-end gradient-based optimization. We evaluate our approach using the Carla autonomous driving simulator, and show that it is significantly more scalable and far more effective at generating attacks than a prior black-box approach based on Bayesian Optimization.

68.A Grid-based Representation for Human Action Recognition ⬇️

Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for the task, and are limited in the way they fuse temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets that demonstrate that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges.

69.Efficient and Compact Convolutional Neural Network Architectures for Non-temporal Real-time Fire Detection ⬇️

Automatic visual fire detection is used to complement traditional fire detection sensor systems (smoke/heat). In this work, we investigate different Convolutional Neural Network (CNN) architectures and their variants for the non-temporal real-time bounds detection of fire pixel regions in video (or still) imagery. Two reduced complexity compact CNN architectures (NasNet-A-OnFire and ShuffleNetV2-OnFire) are proposed through experimental analysis to optimise the computational efficiency for this task. The results improve upon the current state-of-the-art solution for fire detection, achieving an accuracy of 95% for full-frame binary classification and 97% for superpixel localisation. We notably achieve a classification speed up by a factor of 2.3x for binary classification and 1.3x for superpixel localisation, with runtime of 40 fps and 18 fps respectively, outperforming prior work in the field presenting an efficient, robust and real-time solution for fire region detection. Subsequent implementation on low-powered devices (Nvidia Xavier-NX, achieving 49 fps for full-frame classification via ShuffleNetV2-OnFire) demonstrates our architectures are suitable for various real-world deployment applications.

70.Directed Variational Cross-encoder Network for Few-shot Multi-image Co-segmentation ⬇️

In this paper, we propose a novel framework for multi-image co-segmentation using class agnostic meta-learning strategy by generalizing to new classes given only a small number of training samples for each new class. We have developed a novel encoder-decoder network termed as DVICE (Directed Variational Inference Cross Encoder), which learns a continuous embedding space to ensure better similarity learning. We employ a combination of the proposed DVICE network and a novel few-shot learning approach to tackle the small sample size problem encountered in co-segmentation with small datasets like iCoseg and MSRC. Furthermore, the proposed framework does not use any semantic class labels and is entirely class agnostic. Through exhaustive experimentation over multiple datasets using only a small volume of training data, we have demonstrated that our approach outperforms all existing state-of-the-art techniques.

71.Discovering Pattern Structure Using Differentiable Compositing ⬇️

Patterns, which are collections of elements arranged in regular or near-regular arrangements, are an important graphic art form and widely used due to their elegant simplicity and aesthetic appeal. When a pattern is encoded as a flat image without the underlying structure, manually editing the pattern is tedious and challenging as one has to both preserve the individual element shapes and their original relative arrangements. State-of-the-art deep learning frameworks that operate at the pixel level are unsuitable for manipulating such patterns. Specifically, these methods can easily disturb the shapes of the individual elements or their arrangement, and thus fail to preserve the latent structures of the input patterns. We present a novel differentiable compositing operator using pattern elements and use it to discover structures, in the form of a layered representation of graphical objects, directly from raw pattern images. This operator allows us to adapt current deep learning based image methods to effectively handle patterns. We evaluate our method on a range of patterns and demonstrate superiority in the context of pattern manipulations when compared against state-of-the-art

72.The NVIDIA PilotNet Experiments ⬇️

Four years ago, an experimental system known as PilotNet became the first NVIDIA system to steer an autonomous car along a roadway. This system represents a departure from the classical approach for self-driving in which the process is manually decomposed into a series of modules, each performing a different task. In PilotNet, on the other hand, a single deep neural network (DNN) takes pixels as input and produces a desired vehicle trajectory as output; there are no distinct internal modules connected by human-designed interfaces. We believe that handcrafted interfaces ultimately limit performance by restricting information flow through the system and that a learned approach, in combination with other artificial intelligence systems that add redundancy, will lead to better overall performing systems. We continue to conduct research toward that goal.
This document describes the PilotNet lane-keeping effort, carried out over the past five years by our NVIDIA PilotNet group in Holmdel, New Jersey. Here we present a snapshot of system status in mid-2020 and highlight some of the work done by the PilotNet group.

73.DE-GAN: A Conditional Generative Adversarial Network for Document Enhancement ⬇️

Documents often exhibit various forms of degradation, which make it hard to be read and substantially deteriorate the performance of an OCR system. In this paper, we propose an effective end-to-end framework named Document Enhancement Generative Adversarial Networks (DE-GAN) that uses the conditional GANs (cGANs) to restore severely degraded document images. To the best of our knowledge, this practice has not been studied within the context of generative adversarial deep networks. We demonstrate that, in different tasks (document clean up, binarization, deblurring and watermark removal), DE-GAN can produce an enhanced version of the degraded document with a high quality. In addition, our approach provides consistent improvements compared to state-of-the-art methods over the widely used DIBCO 2013, DIBCO 2017 and H-DIBCO 2018 datasets, proving its ability to restore a degraded document image to its ideal condition. The obtained results on a wide variety of degradation reveal the flexibility of the proposed model to be exploited in other document enhancement problems.

74.Gradient Aware Cascade Network for Multi-Focus Image Fusion ⬇️

The general aim of multi-focus image fusion is to gather focused regions of different images to generate a unique all-in-focus fused image. Deep learning based methods become the mainstream of image fusion by virtue of its powerful feature representation ability. However, most of the existing deep learning structures failed to balance fusion quality and end-to-end implementation convenience. End-to-end decoder design often leads to poor performance because of its non-linear mapping mechanism. On the other hand, generating an intermediate decision map achieves better quality for the fused image, but relies on the rectification with empirical post-processing parameter choices. In this work, to handle the requirements of both output image quality and comprehensive simplicity of structure implementation, we propose a cascade network to simultaneously generate decision map and fused result with an end-to-end training procedure. It avoids the dependence on empirical post-processing methods in the inference stage. To improve the fusion quality, we introduce a gradient aware loss function to preserve gradient information in output fused image. In addition, we design a decision calibration strategy to decrease the time consumption in the application of multiple image fusion. Extensive experiments are conducted to compare with 16 different state-of-the-art multi-focus image fusion structures with 6 assessment metrics. The results prove that our designed structure can generally ameliorate the output fused image quality, while implementation efficiency increases over 30% for multiple image fusion.

75.Self-Selective Context for Interaction Recognition ⬇️

Human-object interaction recognition aims for identifying the relationship between a human subject and an object. Researchers incorporate global scene context into the early layers of deep Convolutional Neural Networks as a solution. They report a significant increase in the performance since generally interactions are correlated with the scene (\ie riding bicycle on the city street). However, this approach leads to the following problems. It increases the network size in the early layers, therefore not efficient. It leads to noisy filter responses when the scene is irrelevant, therefore not accurate. It only leverages scene context whereas human-object interactions offer a multitude of contexts, therefore incomplete. To circumvent these issues, in this work, we propose Self-Selective Context (SSC). SSC operates on the joint appearance of human-objects and context to bring the most discriminative context(s) into play for recognition. We devise novel contextual features that model the locality of human-object interactions and show that SSC can seamlessly integrate with the State-of-the-art interaction recognition models. Our experiments show that SSC leads to an important increase in interaction recognition performance, while using much fewer parameters.

76.Picture-to-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images ⬇️

Increased awareness of the impact of food consumption on health and lifestyle today has given rise to novel data-driven food analysis systems. Although these systems may recognize the ingredients, a detailed analysis of their amounts in the meal, which is paramount for estimating the correct nutrition, is usually ignored. In this paper, we study the novel and challenging problem of predicting the relative amount of each ingredient from a food image. We propose PITA, the Picture-to-Amount deep learning architecture to solve the problem. More specifically, we predict the ingredient amounts using a domain-driven Wasserstein loss from image-to-recipe cross-modal embeddings learned to align the two views of food data. Experiments on a dataset of recipes collected from the Internet show the model generates promising results and improves the baselines on this challenging task. A demo of our system and our data is availableat: this http URL.

77.Robust Face Alignment by Multi-order High-precision Hourglass Network ⬇️

Heatmap regression (HR) has become one of the mainstream approaches for face alignment and has obtained promising results under constrained environments. However, when a face image suffers from large pose variations, heavy occlusions and complicated illuminations, the performances of HR methods degrade greatly due to the low resolutions of the generated landmark heatmaps and the exclusion of important high-order information that can be used to learn more discriminative features. To address the alignment problem for faces with extremely large poses and heavy occlusions, this paper proposes a heatmap subpixel regression (HSR) method and a multi-order cross geometry-aware (MCG) model, which are seamlessly integrated into a novel multi-order high-precision hourglass network (MHHN). The HSR method is proposed to achieve high-precision landmark detection by a well-designed subpixel detection loss (SDL) and subpixel detection technology (SDT). At the same time, the MCG model is able to use the proposed multi-order cross information to learn more discriminative representations for enhancing facial geometric constraints and context information. To the best of our knowledge, this is the first study to explore heatmap subpixel regression for robust and high-precision face alignment. The experimental results from challenging benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in the literature.

78.PolarDet: A Fast, More Precise Detector for Rotated Target in Aerial Images ⬇️

Fast and precise object detection for high-resolution aerial images has been a challenging task over the years. Due to the sharp variations on object scale, rotation, and aspect ratio, most existing methods are inefficient and imprecise. In this paper, we represent the oriented objects by polar method in polar coordinate and propose PolarDet, a fast and accurate one-stage object detector based on that representation. Our detector introduces a sub-pixel center semantic structure to further improve classifying veracity. PolarDet achieves nearly all SOTA performance in aerial object detection tasks with faster inference speed. In detail, our approach obtains the SOTA results on DOTA, UCAS-AOD, HRSC with 76.64% mAP, 97.01% mAP, and 90.46% mAP respectively. Most noticeably, our PolarDet gets the best performance and reaches the fastest speed(32fps) at the UCAS-AOD dataset.

79.A Self-supervised Cascaded Refinement Network for Point Cloud Completion ⬇️

Point clouds are often sparse and incomplete, which imposes difficulties for real-world applications, such as 3D object classification, detection and segmentation. Existing shape completion methods tend to generate coarse shapes of objects without fine-grained details. Moreover, current approaches require fully-complete ground truth, which are difficult to obtain in real-world applications. In view of these, we propose a self-supervised object completion method, which optimizes the training procedure solely on the partial input without utilizing the fully-complete ground truth. In order to generate high-quality objects with detailed geometric structures, we propose a cascaded refinement network (CRN) with a coarse-to-fine strategy to synthesize the complete objects. Considering the local details of partial input together with the adversarial training, we are able to learn the complicated distributions of point clouds and generate the object details as realistic as possible. We verify our self-supervised method on both unsupervised and supervised experimental settings and show superior performances. Quantitative and qualitative experiments on different datasets demonstrate that our method achieves more realistic outputs compared to existing state-of-the-art approaches on the 3D point cloud completion task.

80.CQ-VAE: Coordinate Quantized VAE for Uncertainty Estimation with Application to Disk Shape Analysis from Lumbar Spine MRI Images ⬇️

Ambiguity is inevitable in medical images, which often results in different image interpretations (e.g. object boundaries or segmentation maps) from different human experts. Thus, a model that learns the ambiguity and outputs a probability distribution of the target, would be valuable for medical applications to assess the uncertainty of diagnosis. In this paper, we propose a powerful generative model to learn a representation of ambiguity and to generate probabilistic outputs. Our model, named Coordinate Quantization Variational Autoencoder (CQ-VAE) employs a discrete latent space with an internal discrete probability distribution by quantizing the coordinates of a continuous latent space. As a result, the output distribution from CQ-VAE is discrete. During training, Gumbel-Softmax sampling is used to enable backpropagation through the discrete latent space. A matching algorithm is used to establish the correspondence between model-generated samples and "ground-truth" samples, which makes a trade-off between the ability to generate new samples and the ability to represent training samples. Besides these probabilistic components to generate possible outputs, our model has a deterministic path to output the best estimation. We demonstrated our method on a lumbar disk image dataset, and the results show that our CQ-VAE can learn lumbar disk shape variation and uncertainty.

81.Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering ⬇️

Visual Question Answering (VQA) is challenging due to the complex cross-modal relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. This answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves the state-of-the-art accuracy 71.57% using fewer parameters on VQA-v2.0 test-standard split.

82.DEAL: Difficulty-aware Active Learning for Semantic Segmentation ⬇️

Active learning aims to address the paucity of labeled data by finding the most informative samples. However, when applying to semantic segmentation, existing methods ignore the segmentation difficulty of different semantic areas, which leads to poor performance on those hard semantic areas such as tiny or slender objects. To deal with this problem, we propose a semantic Difficulty-awarE Active Learning (DEAL) network composed of two branches: the common segmentation branch and the semantic difficulty branch. For the latter branch, with the supervision of segmentation error between the segmentation result and GT, a pixel-wise probability attention module is introduced to learn the semantic difficulty scores for different semantic areas. Finally, two acquisition functions are devised to select the most valuable samples with semantic difficulty. Competitive results on semantic segmentation benchmarks demonstrate that DEAL achieves state-of-the-art active learning performance and improves the performance of the hard semantic areas in particular.

83.Automatic Tree Ring Detection using Jacobi Sets ⬇️

Tree ring widths are an important source of climatic and historical data, but measuring these widths typically requires extensive manual work. Computer vision techniques provide promising directions towards the automation of tree ring detection, but most automated methods still require a substantial amount of user interaction to obtain high accuracy. We perform analysis on 3D X-ray CT images of a cross-section of a tree trunk, known as a tree disk. We present novel automated methods for locating the pith (center) of a tree disk, and ring boundaries. Our methods use a combination of standard image processing techniques and tools from topological data analysis. We evaluate the efficacy of our method for two different CT scans by comparing its results to manually located rings and centers and show that it is better than current automatic methods in terms of correctly counting each ring and its location. Our methods have several parameters, which we optimize experimentally by minimizing edit distances to the manually obtained locations.

84.MeshMVS: Multi-View Stereo Guided Mesh Reconstruction ⬇️

Deep learning based 3D shape generation methods generally utilize latent features extracted from color images to encode the objects' semantics and guide the shape generation process. These color image semantics only implicitly encode 3D information, potentially limiting the accuracy of the generated shapes. In this paper we propose a multi-view mesh generation method which incorporates geometry information in the color images explicitly by using the features from intermediate 2.5D depth representations of the input images and regularizing the 3D shapes against these depth images. Our system first predicts a coarse 3D volume from the color images by probabilistically merging voxel occupancy grids from individual views. Depth images corresponding to the multi-view color images are predicted which along with the rendered depth images of the coarse shape are used as a contrastive input whose features guide the refinement of the coarse shape through a series of graph convolution networks. Attention-based multi-view feature pooling is proposed to fuse the contrastive depth features from different viewpoints which are fed to the graph convolution networks.
We validate the proposed multi-view mesh generation method on ShapeNet, where we obtain a significant improvement with 34% decrease in chamfer distance to ground truth and 14% increase in the F1-score compared with the state-of-the-art multi-view shape generation method.

85.Long-Term Face Tracking for Crowded Video-Surveillance Scenarios ⬇️

Most current multi-object trackers focus on short-term tracking, and are based on deep and complex systems that do not operate in real-time, often making them impractical for video-surveillance. In this paper, we present a long-term multi-face tracking architecture conceived for working in crowded contexts, particularly unconstrained in terms of movement and occlusions, and where the face is often the only visible part of the person. Our system benefits from advances in the fields of face detection and face recognition to achieve long-term tracking. It follows a tracking-by-detection approach, combining a fast short-term visual tracker with a novel online tracklet reconnection strategy grounded on face verification. Additionally, a correction module is included to correct past track assignments with no extra computational cost. We present a series of experiments introducing novel, specialized metrics for the evaluation of long-term tracking capabilities and a video dataset that we publicly release. Findings demonstrate that, in this context, our approach allows to obtain up to 50% longer tracks than state-of-the-art deep learning trackers.

86.Active Domain Adaptation via Clustering Uncertainty-weighted Embeddings ⬇️

Generalizing deep neural networks to new target domains is critical to their real-world utility. In practice, it may be feasible to get some target data labeled, but to be cost-effective it is desirable to select a maximally-informative subset via active learning (AL). We study this problem of AL under a domain shift. We empirically demonstrate how existing AL approaches based solely on model uncertainty or representative sampling are suboptimal for active domain adaptation. Our algorithm, Active Domain Adaptation via CLustering Uncertainty-weighted Embeddings (ADA-CLUE), i) identifies diverse datapoints for labeling that are both uncertain under the model and representative of unlabeled target data, and ii) leverages the available source and target data for adaptation by optimizing a semi-supervised adversarial entropy loss that is complementary to our active sampling objective. On standard image classification benchmarks for domain adaptation, ADA-CLUE consistently performs as well or better than competing active adaptation, active learning, and domain adaptation methods across shift severities, model initializations, and labeling budgets.

87.Generalized Intersection Algorithms with Fixpoints for Image Decomposition Learning ⬇️

In image processing, classical methods minimize a suitable functional that balances between computational feasibility (convexity of the functional is ideal) and suitable penalties reflecting the desired image decomposition. The fact that algorithms derived from such minimization problems can be used to construct (deep) learning architectures has spurred the development of algorithms that can be trained for a specifically desired image decomposition, e.g. into cartoon and texture. While many such methods are very successful, theoretical guarantees are only scarcely available. To this end, in this contribution, we formalize a general class of intersection point problems encompassing a wide range of (learned) image decomposition models, and we give an existence result for a large subclass of such problems, i.e. giving the existence of a fixpoint of the corresponding algorithm. This class generalizes classical model-based variational problems, such as the TV-l2 -model or the more general TV-Hilbert model. To illustrate the potential for learned algorithms, novel (non learned) choices within our class show comparable results in denoising and texture removal.

88.Ensembling Low Precision Models for Binary Biomedical Image Segmentation ⬇️

Segmentation of anatomical regions of interest such as vessels or small lesions in medical images is still a difficult problem that is often tackled with manual input by an expert. One of the major challenges for this task is that the appearance of foreground (positive) regions can be similar to background (negative) regions. As a result, many automatic segmentation algorithms tend to exhibit asymmetric errors, typically producing more false positives than false negatives. In this paper, we aim to leverage this asymmetry and train a diverse ensemble of models with very high recall, while sacrificing their precision. Our core idea is straightforward: A diverse ensemble of low precision and high recall models are likely to make different false positive errors (classifying background as foreground in different parts of the image), but the true positives will tend to be consistent. Thus, in aggregate the false positive errors will cancel out, yielding high performance for the ensemble. Our strategy is general and can be applied with any segmentation model. In three different applications (carotid artery segmentation in a neck CT angiography, myocardium segmentation in a cardiovascular MRI and multiple sclerosis lesion segmentation in a brain MRI), we show how the proposed approach can significantly boost the performance of a baseline segmentation method.

89.Zoom-CAM: Generating Fine-grained Pixel Annotations from Image Labels ⬇️

Current weakly supervised object localization and segmentation rely on class-discriminative visualization techniques to generate pseudo-labels for pixel-level training. Such visualization methods, including class activation mapping (CAM) and Grad-CAM, use only the deepest, lowest resolution convolutional layer, missing all information in intermediate layers. We propose Zoom-CAM: going beyond the last lowest resolution layer by integrating the importance maps over all activations in intermediate layers. Zoom-CAM captures fine-grained small-scale objects for various discriminative class instances, which are commonly missed by the baseline visualization methods. We focus on generating pixel-level pseudo-labels from class labels. The quality of our pseudo-labels evaluated on the ImageNet localization task exhibits more than 2.8% improvement on top-1 error. For weakly supervised semantic segmentation our generated pseudo-labels improve a state of the art model by 1.1%.

90.A Survey of Machine Learning Techniques in Adversarial Image Forensics ⬇️

Image forensic plays a crucial role in both criminal investigations (e.g., dissemination of fake images to spread racial hate or false narratives about specific ethnicity groups) and civil litigation (e.g., defamation). Increasingly, machine learning approaches are also utilized in image forensics. However, there are also a number of limitations and vulnerabilities associated with machine learning-based approaches, for example how to detect adversarial (image) examples, with real-world consequences (e.g., inadmissible evidence, or wrongful conviction). Therefore, with a focus on image forensics, this paper surveys techniques that can be used to enhance the robustness of machine learning-based binary manipulation detectors in various adversarial scenarios.

91.RobustBench: a standardized adversarial robustness benchmark ⬇️

Evaluation of adversarial robustness is often error-prone leading to overestimation of the true robustness of models. While adaptive attacks designed for a particular defense are a way out of this, there are only approximate guidelines on how to perform them. Moreover, adaptive evaluations are highly customized for particular models, which makes it difficult to compare different defenses. Our goal is to establish a standardized benchmark of adversarial robustness, which as accurately as possible reflects the robustness of the considered models within a reasonable computational budget. This requires to impose some restrictions on the admitted models to rule out defenses that only make gradient-based attacks ineffective without improving actual robustness. We evaluate robustness of models for our benchmark with AutoAttack, an ensemble of white- and black-box attacks which was recently shown in a large-scale study to improve almost all robustness evaluations compared to the original publications. Our leaderboard, hosted at this http URL, aims at reflecting the current state of the art on a set of well-defined tasks in $\ell_\infty$- and $\ell_2$-threat models with possible extensions in the future. Additionally, we open-source the library this http URL that provides unified access to state-of-the-art robust models to facilitate their downstream applications. Finally, based on the collected models, we analyze general trends in $\ell_p$-robustness and its impact on other tasks such as robustness to various distribution shifts and out-of-distribution detection.

92.Agent-based Simulation Model and Deep Learning Techniques to Evaluate and Predict Transportation Trends around COVID-19 ⬇️

The COVID-19 pandemic has affected travel behaviors and transportation system operations, and cities are grappling with what policies can be effective for a phased reopening shaped by social distancing. This edition of the white paper updates travel trends and highlights an agent-based simulation model's results to predict the impact of proposed phased reopening strategies. It also introduces a real-time video processing method to measure social distancing through cameras on city streets.

93.Optimism in the Face of Adversity: Understanding and Improving Deep Learning through Adversarial Robustness ⬇️

Driven by massive amounts of data and important advances in computational resources, new deep learning systems have achieved outstanding results in a large spectrum of applications. Nevertheless, our current theoretical understanding on the mathematical foundations of deep learning lags far behind its empirical success. Towards solving the vulnerability of neural networks, however, the field of adversarial robustness has recently become one of the main sources of explanations of our deep models. In this article, we provide an in-depth review of the field of adversarial robustness in deep learning, and give a self-contained introduction to its main notions. But, in contrast to the mainstream pessimistic perspective of adversarial robustness, we focus on the main positive aspects that it entails. We highlight the intuitive connection between adversarial examples and the geometry of deep neural networks, and eventually explore how the geometric study of adversarial examples can serve as a powerful tool to understand deep learning. Furthermore, we demonstrate the broad applicability of adversarial robustness, providing an overview of the main emerging applications of adversarial robustness beyond security. The goal of this article is to provide readers with a set of new perspectives to understand deep learning, and to supply them with intuitive tools and insights on how to use adversarial robustness to improve it.

94.Practical Frank-Wolfe algorithms ⬇️

In the last decade there has been a resurgence of interest in Frank-Wolfe (FW) style methods for optimizing a smooth convex function over a polytope. Examples of recently developed techniques include {\em Decomposition-invariant Conditional Gradient} (DiCG), {\em Blended Condition Gradient} (BCG), and {\em Frank-Wolfe with in-face directions} (IF-FW) methods. We introduce two extensions of these techniques. First, we augment DiCG with the {\em working set} strategy, and show how to optimize over the working set using {\em shadow simplex steps}. Second, we generalize in-face Frank-Wolfe directions to polytopes in which faces cannot be efficiently computed, and also describe a generic recursive procedure that can be used in conjunction with several FW-style techniques. Experimental results indicate that these extensions are capable of speeding up original algorithms by orders of magnitude for certain applications.

95.GASNet: Weakly-supervised Framework for COVID-19 Lesion Segmentation ⬇️

Segmentation of infected areas in chest CT volumes is of great significance for further diagnosis and treatment of COVID-19 patients. Due to the complex shapes and varied appearances of lesions, a large number of voxel-level labeled samples are generally required to train a lesion segmentation network, which is a main bottleneck for developing deep learning based medical image segmentation algorithms. In this paper, we propose a weakly-supervised lesion segmentation framework by embedding the Generative Adversarial training process into the Segmentation Network, which is called GASNet. GASNet is optimized to segment the lesion areas of a COVID-19 CT by the segmenter, and to replace the abnormal appearance with a generated normal appearance by the generator, so that the restored CT volumes are indistinguishable from healthy CT volumes by the discriminator. GASNet is supervised by chest CT volumes of many healthy and COVID-19 subjects without voxel-level annotations. Experiments on three public databases show that when using as few as one voxel-level labeled sample, the performance of GASNet is comparable to fully-supervised segmentation algorithms trained on dozens of voxel-level labeled samples.

96.Measuring breathing induced oesophageal motion and its dosimetric impact ⬇️

Stereotactic body radiation therapy allows for a precise and accurate dose delivery. Organ motion during treatment bares the risk of undetected high dose healthy tissue exposure. An organ very susceptible to high dose is the oesophagus. Its low contrast on CT and the oblong shape renders motion estimation difficult. We tackle this issue by modern algorithms to measure the oesophageal motion voxel-wise and to estimate motion related dosimetric impact. Oesophageal motion was measured using deformable image registration and 4DCT of 11 internal and 5 public datasets. Current clinical practice of contouring the organ on 3DCT was compared to timely resolved 4DCT contours. The dosimetric impact of the motion was estimated by analysing the trajectory of each voxel in the 4D dose distribution. Finally an organ motion model was built, allowing for easier patient-wise comparisons. Motion analysis showed mean absolute maximal motion amplitudes of 4.24 +/- 2.71 mm left-right, 4.81 +/- 2.58 mm anterior-posterior and 10.21 +/- 5.13 mm superior-inferior. Motion between the cohorts differed significantly. In around 50 % of the cases the dosimetric passing criteria was violated. Contours created on 3DCT did not cover 14 % of the organ for 50 % of the respiratory cycle and the 3D contour is around 38 % smaller than the union of all 4D contours. The motion model revealed that the maximal motion is not limited to the lower part of the organ. Our results showed motion amplitudes higher than most reported values in the literature and that motion is very heterogeneous across patients. Therefore, individual motion information should be considered in contouring and planning.

97.Semantic Histogram Based Graph Matching for Real-Time Multi-Robot Global Localization in Large Scale Environment ⬇️

The core problem of visual multi-robot simultaneous localization and mapping (MR-SLAM) is how to efficiently and accurately perform multi-robot global localization (MR-GL). The difficulties are two-fold. The first is the difficulty of global localization for significant viewpoint difference. Appearance-based localization methods tend to fail under large viewpoint changes. Recently, semantic graphs have been utilized to overcome the viewpoint variation problem. However, the methods are highly time-consuming, especially in large-scale environments. This leads to the second difficulty, which is how to perform real-time global localization. In this paper, we propose a semantic histogram-based graph matching method that is robust to viewpoint variation and can achieve real-time global localization. Based on that, we develop a system that can accurately and efficiently perform MR-GL for both homogeneous and heterogeneous robots. The experimental results show that our approach is about 30 times faster than Random Walk based semantic descriptors. Moreover, it achieves an accuracy of 95% for global localization, while the accuracy of the state-of-the-art method is 85%.

98.Meta-learning the Learning Trends Shared Across Tasks ⬇️

Meta-learning stands for 'learning to learn' such that generalization to new tasks is achieved. Among these methods, Gradient-based meta-learning algorithms are a specific sub-class that excel at quick adaptation to new tasks with limited data. This demonstrates their ability to acquire transferable knowledge, a capability that is central to human learning. However, the existing meta-learning approaches only depend on the current task information during the adaptation, and do not share the meta-knowledge of how a similar task has been adapted before. To address this gap, we propose a 'Path-aware' model-agnostic meta-learning approach. Specifically, our approach not only learns a good initialization for adaptation, it also learns an optimal way to adapt these parameters to a set of task-specific parameters, with learnable update directions, learning rates and, most importantly, the way updates evolve over different time-steps. Compared to the existing meta-learning methods, our approach offers: (a) The ability to learn gradient-preconditioning at different time-steps of the inner-loop, thereby modeling the dynamic learning behavior shared across tasks, and (b) The capability of aggregating the learning context through the provision of direct gradient-skip connections from the old time-steps, thus avoiding overfitting and improving generalization. In essence, our approach not only learns a transferable initialization, but also models the optimal update directions, learning rates, and task-specific learning trends. Specifically, in terms of learning trends, our approach determines the way update directions shape up as the task-specific learning progresses and how the previous update history helps in the current update. Our approach is simple to implement and demonstrates faster convergence. We report significant performance improvements on a number of FSL datasets.

99.MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch Normalization ⬇️

Substantial experiments have validated the success of Batch Normalization (BN) Layer in benefiting convergence and generalization. However, BN requires extra memory and float-point calculation. Moreover, BN would be inaccurate on micro-batch, as it depends on batch statistics. In this paper, we address these problems by simplifying BN regularization while keeping two fundamental impacts of BN layers, i.e., data decorrelation and adaptive learning rate. We propose a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training. MimicNorm consists of only two light operations, including modified weight mean operations (subtract mean values from weight parameter tensor) and one BN layer before loss function (last BN layer). We leverage the neural tangent kernel (NTK) theory to prove that our weight mean operation whitens activations and transits network into the chaotic regime like BN layer, and consequently, leads to an enhanced convergence. The last BN layer provides autotuned learning rates and also improves accuracy. Experimental results show that MimicNorm achieves similar accuracy for various network structures, including ResNets and lightweight networks like ShuffleNet, with a reduction of about 20% memory consumption. The code is publicly available at this https URL.

100.Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders ⬇️

Discrete latent spaces in variational autoencoders have been shown to effectively capture the data distribution for many real-world problems such as natural language understanding, human intent prediction, and visual scene representation. However, discrete latent spaces need to be sufficiently large to capture the complexities of real-world data, rendering downstream tasks computationally challenging. For instance, performing motion planning in a high-dimensional latent representation of the environment could be intractable. We consider the problem of sparsifying the discrete latent space of a trained conditional variational autoencoder, while preserving its learned multimodality. As a post hoc latent space reduction technique, we use evidential theory to identify the latent classes that receive direct evidence from a particular input condition and filter out those that do not. Experiments on diverse tasks, such as image generation and human behavior prediction, demonstrate the effectiveness of our proposed technique at reducing the discrete latent sample space size of a model while maintaining its learned multimodality.

101.Unsupervised Foveal Vision Neural Networks with Top-Down Attention ⬇️

Deep learning architectures are an extremely powerful tool for recognizing and classifying images. However, they require supervised learning and normally work on vectors the size of image pixels and produce the best results when trained on millions of object images. To help mitigate these issues, we propose the fusion of bottom-up saliency and top-down attention employing only unsupervised learning techniques, which helps the object recognition module to focus on relevant data and learn important features that can later be fine-tuned for a specific task. In addition, by utilizing only relevant portions of the data, the training speed can be greatly improved. We test the performance of the proposed Gamma saliency technique on the Toronto and CAT2000 databases, and the foveated vision in the Street View House Numbers (SVHN) database. The results in foveated vision show that Gamma saliency is comparable to the best and computationally faster. The results in SVHN show that our unsupervised cognitive architecture is comparable to fully supervised methods and that the Gamma saliency also improves CNN performance if desired. We also develop a topdown attention mechanism based on the Gamma saliency applied to the top layer of CNNs to improve scene understanding in multi-object images or images with strong background clutter. When we compare the results with human observers in an image dataset of animals occluded in natural scenes, we show that topdown attention is capable of disambiguating object from background and improves system performance beyond the level of human observers.

102.Feature Importance Ranking for Deep Learning ⬇️

Feature importance ranking has become a powerful tool for explainable AI. However, its nature of combinatorial optimization poses a great challenge for deep learning. In this paper, we propose a novel dual-net architecture consisting of operator and selector for discovery of an optimal feature subset of a fixed size and ranking the importance of those features in the optimal subset simultaneously. During learning, the operator is trained for a supervised learning task via optimal feature subset candidates generated by the selector that learns predicting the learning performance of the operator working on different optimal subset candidates. We develop an alternate learning algorithm that trains two nets jointly and incorporates a stochastic local search procedure into learning to address the combinatorial optimization challenge. In deployment, the selector generates an optimal feature subset and ranks feature importance, while the operator makes predictions based on the optimal subset for test data. A thorough evaluation on synthetic, benchmark and real data sets suggests that our approach outperforms several state-of-the-art feature importance ranking and supervised feature selection methods. (Our source code is available: this https URL)

103.Shape Constrained CNN for Cardiac MR Segmentation with Simultaneous Prediction of Shape and Pose Parameters ⬇️

Semantic segmentation using convolutional neural networks (CNNs) is the state-of-the-art for many medical segmentation tasks including left ventricle (LV) segmentation in cardiac MR images. However, a drawback is that these CNNs lack explicit shape constraints, occasionally resulting in unrealistic segmentations. In this paper, we perform LV and myocardial segmentation by regression of pose and shape parameters derived from a statistical shape model. The integrated shape model regularizes predicted segmentations and guarantees realistic shapes. Furthermore, in contrast to semantic segmentation, it allows direct calculation of regional measures such as myocardial thickness. We enforce robustness of shape and pose prediction by simultaneously constructing a segmentation distance map during training. We evaluated the proposed method in a fivefold cross validation on a in-house clinical dataset with 75 subjects containing a total of 1539 delineated short-axis slices covering LV from apex to base, and achieved a correlation of 99% for LV area, 94% for myocardial area, 98% for LV dimensions and 88% for regional wall thicknesses. The method was additionally validated on the LVQuan18 and LVQuan19 public datasets and achieved state-of-the-art results.

104.Light Stage Super-Resolution: Continuous High-Frequency Relighting ⬇️

The light stage has been widely used in computer graphics for the past two decades, primarily to enable the relighting of human faces. By capturing the appearance of the human subject under different light sources, one obtains the light transport matrix of that subject, which enables image-based relighting in novel environments. However, due to the finite number of lights in the stage, the light transport matrix only represents a sparse sampling on the entire sphere. As a consequence, relighting the subject with a point light or a directional source that does not coincide exactly with one of the lights in the stage requires interpolation and resampling the images corresponding to nearby lights, and this leads to ghosting shadows, aliased specularities, and other artifacts. To ameliorate these artifacts and produce better results under arbitrary high-frequency lighting, this paper proposes a learning-based solution for the "super-resolution" of scans of human faces taken from a light stage. Given an arbitrary "query" light direction, our method aggregates the captured images corresponding to neighboring lights in the stage, and uses a neural network to synthesize a rendering of the face that appears to be illuminated by a "virtual" light source at the query location. This neural network must circumvent the inherent aliasing and regularity of the light stage data that was used for training, which we accomplish through the use of regularized traditional interpolation methods within our network. Our learned model is able to produce renderings for arbitrary light directions that exhibit realistic shadows and specular highlights, and is able to generalize across a wide variety of subjects.

105.Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning ⬇️

Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state-action transition as a conditional generative process by generating the next-state prediction under the condition of the current state, action, and latent variable. We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-the-art environment model-based exploration approaches.

106.A Corpus for English-Japanese Multimodal Neural Machine Translation with Comparable Sentences ⬇️

Multimodal neural machine translation (NMT) has become an increasingly important area of research over the years because additional modalities, such as image data, can provide more context to textual data. Furthermore, the viability of training multimodal NMT models without a large parallel corpus continues to be investigated due to low availability of parallel sentences with images, particularly for English-Japanese data. However, this void can be filled with comparable sentences that contain bilingual terms and parallel phrases, which are naturally created through media such as social network posts and e-commerce product descriptions. In this paper, we propose a new multimodal English-Japanese corpus with comparable sentences that are compiled from existing image captioning datasets. In addition, we supplement our comparable sentences with a smaller parallel corpus for validation and test purposes. To test the performance of this comparable sentence translation scenario, we train several baseline NMT models with our comparable corpus and evaluate their English-Japanese translation performance. Due to low translation scores in our baseline experiments, we believe that current multimodal NMT models are not designed to effectively utilize comparable sentence data. Despite this, we hope for our corpus to be used to further research into multimodal NMT with comparable sentences.

107.Class-incremental Learning with Pre-allocated Fixed Classifiers ⬇️

In class-incremental learning, a learning agent faces a stream of data with the goal of learning new classes while not forgetting previous ones. Neural networks are known to suffer under this setting, as they forget previously acquired knowledge. To address this problem, effective methods exploit past data stored in an episodic memory while expanding the final classifier nodes to accommodate the new classes.
In this work, we substitute the expanding classifier with a novel fixed classifier in which a number of pre-allocated output nodes are subject to the classification loss right from the beginning of the learning phase. Contrarily to the standard expanding classifier, this allows: (a) the output nodes of future unseen classes to firstly see negative samples since the beginning of learning together with the positive samples that incrementally arrive; (b) to learn features that do not change their geometric configuration as novel classes are incorporated in the learning model.
Experiments with public datasets show that the proposed approach is as effective as the expanding classifier while exhibiting novel intriguing properties of the internal feature representation that are otherwise not-existent. Our ablation study on pre-allocating a large number of classes further validates the approach.

108.A general approach to compute the relevance of middle-level input features ⬇️

This work proposes a novel general framework, in the context of eXplainable Artificial Intelligence (XAI), to construct explanations for the behaviour of Machine Learning (ML) models in terms of middle-level features. One can isolate two different ways to provide explanations in the context of XAI: low and middle-level explanations. Middle-level explanations have been introduced for alleviating some deficiencies of low-level explanations such as, in the context of image classification, the fact that human users are left with a significant interpretive burden: starting from low-level explanations, one has to identify properties of the overall input that are perceptually salient for the human visual system. However, a general approach to correctly evaluate the elements of middle-level explanations with respect ML model responses has never been proposed in the literature.

109.CT Image Segmentation for Inflamed and Fibrotic Lungs Using a Multi-Resolution Convolutional Neural Network ⬇️

The purpose of this study was to develop a fully-automated segmentation algorithm, robust to various density enhancing lung abnormalities, to facilitate rapid quantitative analysis of computed tomography images. A polymorphic training approach is proposed, in which both specifically labeled left and right lungs of humans with COPD, and nonspecifically labeled lungs of animals with acute lung injury, were incorporated into training a single neural network. The resulting network is intended for predicting left and right lung regions in humans with or without diffuse opacification and consolidation. Performance of the proposed lung segmentation algorithm was extensively evaluated on CT scans of subjects with COPD, confirmed COVID-19, lung cancer, and IPF, despite no labeled training data of the latter three diseases. Lobar segmentations were obtained using the left and right lung segmentation as input to the LobeNet algorithm. Regional lobar analysis was performed using hierarchical clustering to identify radiographic subtypes of COVID-19. The proposed lung segmentation algorithm was quantitatively evaluated using semi-automated and manually-corrected segmentations in 87 COVID-19 CT images, achieving an average symmetric surface distance of $0.495 \pm 0.309$ mm and Dice coefficient of $0.985 \pm 0.011$. Hierarchical clustering identified four radiographical phenotypes of COVID-19 based on lobar fractions of consolidated and poorly aerated tissue. Lower left and lower right lobes were consistently more afflicted with poor aeration and consolidation. However, the most severe cases demonstrated involvement of all lobes. The polymorphic training approach was able to accurately segment COVID-19 cases with diffuse consolidation without requiring COVID-19 cases for training.