Skip to content

Latest commit

 

History

History
265 lines (265 loc) · 182 KB

20201208.md

File metadata and controls

265 lines (265 loc) · 182 KB

ArXiv cs.CV --Tue, 8 Dec 2020

1.Identity-Driven DeepFake Detection ⬇️

DeepFake detection has so far been dominated by artifact-driven'' methods and the detection performance significantly degrades when either the type of image artifacts is unknown or the artifacts are simply too hard to find. In this work, we present an alternative approach: Identity-Driven DeepFake Detection. Our approach takes as input the suspect image/video as well as the target identity information (a reference image or video). We output a decision on whether the identity in the suspect image/video is the same as the target identity. Our motivation is to prevent the most common and harmful DeepFakes that spread false information of a targeted person. The identity-based approach is fundamentally different in that it does not attempt to detect image artifacts. Instead, it focuses on whether the identity in the suspect image/video is true. To facilitate research on identity-based detection, we present a new large scale dataset Vox-DeepFake", in which each suspect content is associated with multiple reference images collected from videos of a target identity. We also present a simple identity-based detection algorithm called the OuterFace, which may serve as a baseline for further research. Even trained without fake videos, the OuterFace algorithm achieves superior detection accuracy and generalizes well to different DeepFake methods, and is robust with respect to video degradation techniques -- a performance not achievable with existing detection algorithms.

2.NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis ⬇️

We present a method that takes as input a set of images of a scene illuminated by unconstrained known lighting, and produces as output a 3D representation that can be rendered from novel viewpoints under arbitrary lighting conditions. Our method represents the scene as a continuous volumetric function parameterized as MLPs whose inputs are a 3D location and whose outputs are the following scene properties at that input location: volume density, surface normal, material parameters, distance to the first surface intersection in any direction, and visibility of the external environment in any direction. Together, these allow us to render novel views of the object under arbitrary lighting, including indirect illumination effects. The predicted visibility and surface intersection fields are critical to our model's ability to simulate direct and indirect illumination during training, because the brute-force techniques used by prior work are intractable for lighting conditions outside of controlled setups with a single light. Our method outperforms alternative approaches for recovering relightable 3D scene representations, and performs well in complex lighting settings that have posed a significant challenge to prior work.

3.NeRD: Neural Reflectance Decomposition from Image Collections ⬇️

Decomposing a scene into its shape, reflectance, and illumination is a challenging but essential problem in computer vision and graphics. This problem is inherently more challenging when the illumination is not a single light source under laboratory conditions but is instead an unconstrained environmental illumination. Though recent work has shown that implicit representations can be used to model the radiance field of an object, these techniques only enable view synthesis and not relighting. Additionally, evaluating these radiance fields is resource and time-intensive. By decomposing a scene into explicit representations, any rendering framework can be leveraged to generate novel views under any illumination in real-time. NeRD is a method that achieves this decomposition by introducing physically-based rendering to neural radiance fields. Even challenging non-Lambertian reflectances, complex geometry, and unknown illumination can be decomposed to high-quality models.

4.MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation ⬇️

Navigation tasks in photorealistic 3D environments are challenging because they require perception and effective planning under partial observability. Recent work shows that map-like memory is useful for long-horizon navigation tasks. However, a focused investigation of the impact of maps on navigation tasks of varying complexity has not yet been performed. We propose the multiON task, which requires navigation to an episode-specific sequence of objects in a realistic environment. MultiON generalizes the ObjectGoal navigation task and explicitly tests the ability of navigation agents to locate previously observed goal objects. We perform a set of multiON experiments to examine how a variety of agent models perform across a spectrum of navigation task complexities. Our experiments show that: i) navigation performance degrades dramatically with escalating task complexity; ii) a simple semantic map agent performs surprisingly well relative to more complex neural image feature map agents; and iii) even oracle map agents achieve relatively low performance, indicating the potential for future work in training embodied navigation agents using maps. Video summary: this https URL

5.Learning Video Instance Segmentation with Recurrent Graph Neural Networks ⬇️

Most existing approaches to video instance segmentation comprise multiple modules that are heuristically combined to produce the final output. Formulating a purely learning-based method instead, which models both the temporal aspect as well as a generic track management required to solve the video instance segmentation task, is a highly challenging problem. In this work, we propose a novel learning formulation, where the entire video instance segmentation problem is modelled jointly. We fit a flexible model to our formulation that, with the help of a graph neural network, processes all available new information in each frame. Past information is considered and processed via a recurrent connection. We demonstrate the effectiveness of the proposed approach in comprehensive experiments. Our approach, operating at over 25 FPS, outperforms previous video real-time methods. We further conduct detailed ablative experiments that validate the different aspects of our approach.

6.Model Compression Using Optimal Transport ⬇️

Model compression methods are important to allow for easier deployment of deep learning models in compute, memory and energy-constrained environments such as mobile phones. Knowledge distillation is a class of model compression algorithm where knowledge from a large teacher network is transferred to a smaller student network thereby improving the student's performance. In this paper, we show how optimal transport-based loss functions can be used for training a student network which encourages learning student network parameters that help bring the distribution of student features closer to that of the teacher features. We present image classification results on CIFAR-100, SVHN and ImageNet and show that the proposed optimal transport loss functions perform comparably to or better than other loss functions.

7.IHashNet: Iris Hashing Network based on efficient multi-index hashing ⬇️

Massive biometric deployments are pervasive in today's world. But despite the high accuracy of biometric systems, their computational efficiency degrades drastically with an increase in the database size. Thus, it is essential to index them. An ideal indexing scheme needs to generate codes that preserve the intra-subject similarity as well as inter-subject dissimilarity. Here, in this paper, we propose an iris indexing scheme using real-valued deep iris features binarized to iris bar codes (IBC) compatible with the indexing structure. Firstly, for extracting robust iris features, we have designed a network utilizing the domain knowledge of ordinal filtering and learning their nonlinear combinations. Later these real-valued features are binarized. Finally, for indexing the iris dataset, we have proposed a loss that can transform the binary feature into an improved feature compatible with the Multi-Index Hashing scheme. This loss function ensures the hamming distance equally distributed among all the contiguous disjoint sub-strings. To the best of our knowledge, this is the first work in the iris indexing domain that presents an end-to-end iris indexing structure. Experimental results on four datasets are presented to depict the efficacy of the proposed approach.

8.Traffic flow prediction using Deep Sedenion Networks ⬇️

Traffic4cast2020 is the second year of NeurIPS competition where participants are to predict traffic parameters (speed and volume) for 3 different cities. The information provided includes multi-channeled multiple time steps inputs and predicted output. Multiple output time steps were viewed as a multi-task segmentation in this work, forming the basis for applying hypercomplex number based UNet structure. The presence of 12 spatio-temporal multi-channel dynamic inputs and single static input, while sedenion numbers are 16-dimensional (16-ions) forms the basis of using sedenion hypercomplex number. We use 12 of the 15 sedenion imaginary parts for the dynamic inputs and the real part is used for the static input. The sedenion network enables the solution of this challenge by using multi-task learning, a situation of the traffic prediction with different time steps to compute.

9.End-to-end Handwritten Paragraph Text Recognition Using a Vertical Attention Network ⬇️

Unconstrained handwritten text recognition remains challenging for computer vision systems. Paragraph text recognition is traditionally achieved by two models: the first one for line segmentation and the second one for text line recognition. We propose a unified end-to-end model using hybrid attention to tackle this task. We achieve state-of-the-art character error rate at line and paragraph levels on three popular datasets: 1.90% for RIMES, 4.32% for IAM and 3.63% for READ 2016. The proposed model can be trained from scratch, without using any segmentation label contrary to the standard approach. Our code and trained model weights are available at this https URL.

10.Correct block-design experiments mitigate temporal correlation bias in EEG classification ⬇️

It is argued in [1] that [2] was able to classify EEG responses to visual stimuli solely because of the temporal correlation that exists in all EEG data and the use of a block design. We here show that the main claim in [1] is drastically overstated and their other analyses are seriously flawed by wrong methodological choices. To validate our counter-claims, we evaluate the performance of state-of-the-art methods on the dataset in [2] reaching about 50% classification accuracy over 40 classes, lower than in [2], but still significant. We then investigate the influence of EEG temporal correlation on classification accuracy by testing the same models in two additional experimental settings: one that replicates [1]'s rapid-design experiment, and another one that examines the data between blocks while subjects are shown a blank screen. In both cases, classification accuracy is at or near chance, in contrast to what [1] reports, indicating a negligible contribution of temporal correlation to classification accuracy. We, instead, are able to replicate the results in [1] only when intentionally contaminating our data by inducing a temporal correlation. This suggests that what Li et al. [1] demonstrate is that their data are strongly contaminated by temporal correlation and low signal-to-noise ratio. We argue that the reason why Li et al. [1] observe such high correlation in EEG data is their unconventional experimental design and settings that violate the basic cognitive neuroscience design recommendations, first and foremost the one of limiting the experiments' duration, as instead done in [2]. Our analyses in this paper refute the claims of the "perils and pitfalls of block-design" in [1]. Finally, we conclude the paper by examining a number of other oversimplistic statements, inconsistencies, misinterpretation of machine learning concepts, speculations and misleading claims in [1].

11.Sparse Fooling Images: Fooling Machine Perception through Unrecognizable Images ⬇️

In recent years, deep neural networks (DNNs) have achieved equivalent or even higher accuracy in various recognition tasks than humans. However, some images exist that lead DNNs to a completely wrong decision, whereas humans never fail with these images. Among others, fooling images are those that are not recognizable as natural objects such as dogs and cats, but DNNs classify these images into classes with high confidence scores. In this paper, we propose a new class of fooling images, sparse fooling images (SFIs), which are single color images with a small number of altered pixels. Unlike existing fooling images, which retain some characteristic features of natural objects, SFIs do not have any local or global features that can be recognizable to humans; however, in machine perception (i.e., by DNN classifiers), SFIs are recognizable as natural objects and classified to certain classes with high confidence scores. We propose two methods to generate SFIs for different settings~(semiblack-box and white-box). We also experimentally demonstrate the vulnerability of DNNs through out-of-distribution detection and compare three architectures in terms of the robustness against SFIs. This study gives rise to questions on the structure and robustness of CNNs and discusses the differences between human and machine perception.

12.CycleQSM: Unsupervised QSM Deep Learning using Physics-Informed CycleGAN ⬇️

Quantitative susceptibility mapping (QSM) is a useful magnetic resonance imaging (MRI) technique which provides spatial distribution of magnetic susceptibility values of tissues. QSMs can be obtained by deconvolving the dipole kernel from phase images, but the spectral nulls in the dipole kernel make the inversion ill-posed. In recent times, deep learning approaches have shown a comparable QSM reconstruction performance as the classic approaches, despite the fast reconstruction time. Most of the existing deep learning methods are, however, based on supervised learning, so matched pairs of input phase images and the ground-truth maps are needed. Moreover, it was reported that the supervised learning often leads to underestimated QSM values. To address this, here we propose a novel unsupervised QSM deep learning method using physics-informed cycleGAN, which is derived from optimal transport perspective. In contrast to the conventional cycleGAN, our novel cycleGAN has only one generator and one discriminator thanks to the known dipole kernel. Experimental results confirm that the proposed method provides more accurate QSM maps compared to the existing deep learning approaches, and provide competitive performance to the best classical approaches despite the ultra-fast reconstruction.

13.Self-supervised asymmetric deep hashing with margin-scalable constraint for image retrieval ⬇️

Due to its validity and rapidity, image retrieval based on deep hashing approaches is widely concerned especially in large-scale visual search. However, many existing deep hashing methods inadequately utilize label information as guidance of feature learning network without more advanced exploration in semantic space, besides the similarity correlations in hamming space are not fully discovered and embedded into hash codes, by which the retrieval quality is diminished with inefficient preservation of pairwise correlations and multi-label semantics. To cope with these problems, we propose a novel self-supervised asymmetric deep hashing with margin-scalable constraint(SADH) approach for image retrieval. SADH implements a self-supervised network to preserve supreme semantic information in a semantic feature map and a semantic code map for each semantics of the given dataset, which efficiently-and-precisely guides a feature learning network to preserve multi-label semantic information with asymmetric learning strategy. Moreover, for the feature learning part, by further exploiting semantic maps, a new margin-scalable constraint is employed for both highly-accurate construction of pairwise correlation in the hamming space and more discriminative hash code representation. Extensive empirical research on three benchmark datasets validate that the proposed method outperforms several state-of-the-art approaches.

14.Pose-Guided Human Animation from a Single Image in the Wild ⬇️

We present a new pose transfer method for synthesizing a human animation from a single image of a person controlled by a sequence of body poses. Existing pose transfer methods exhibit significant visual artifacts when applying to a novel scene, resulting in temporal inconsistency and failures in preserving the identity and textures of the person. To address these limitations, we design a compositional neural network that predicts the silhouette, garment labels, and textures. Each modular network is explicitly dedicated to a subtask that can be learned from the synthetic data. At the inference time, we utilize the trained network to produce a unified representation of appearance and its labels in UV coordinates, which remains constant across poses. The unified representation provides an incomplete yet strong guidance to generating the appearance in response to the pose change. We use the trained network to complete the appearance and render it with the background. With these strategies, we are able to synthesize human animations that can preserve the identity and appearance of the person in a temporally coherent way without any fine-tuning of the network on the testing scene. Experiments show that our method outperforms the state-of-the-arts in terms of synthesis quality, temporal coherence, and generalization ability.

15.Matching Distributions via Optimal Transport for Semi-Supervised Learning ⬇️

Semi-Supervised Learning (SSL) approaches have been an influential framework for the usage of unlabeled data when there is not a sufficient amount of labeled data available over the course of training. SSL methods based on Convolutional Neural Networks (CNNs) have recently provided successful results on standard benchmark tasks such as image classification. In this work, we consider the general setting of SSL problem where the labeled and unlabeled data come from the same underlying probability distribution. We propose a new approach that adopts an Optimal Transport (OT) technique serving as a metric of similarity between discrete empirical probability measures to provide pseudo-labels for the unlabeled data, which can then be used in conjunction with the initial labeled data to train the CNN model in an SSL manner. We have evaluated and compared our proposed method with state-of-the-art SSL algorithms on standard datasets to demonstrate the superiority and effectiveness of our SSL algorithm.

16.Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion ⬇️

LiDAR point cloud analysis is a core task for 3D computer vision, especially for autonomous driving. However, due to the severe sparsity and noise interference in the single sweep LiDAR point cloud, the accurate semantic segmentation is non-trivial to achieve. In this paper, we propose a novel sparse LiDAR point cloud semantic segmentation framework assisted by learned contextual shape priors. In practice, an initial semantic segmentation (SS) of a single sweep point cloud can be achieved by any appealing network and then flows into the semantic scene completion (SSC) module as the input. By merging multiple frames in the LiDAR sequence as supervision, the optimized SSC module has learned the contextual shape priors from sequential LiDAR data, completing the sparse single sweep point cloud to the dense one. Thus, it inherently improves SS optimization through fully end-to-end training. Besides, a Point-Voxel Interaction (PVI) module is proposed to further enhance the knowledge fusion between SS and SSC tasks, i.e., promoting the interaction of incomplete local geometry of point cloud and complete voxel-wise global structure. Furthermore, the auxiliary SSC and PVI modules can be discarded during inference without extra burden for SS. Extensive experiments confirm that our JS3C-Net achieves superior performance on both SemanticKITTI and SemanticPOSS benchmarks, i.e., 4% and 3% improvement correspondingly.

17.Unsupervised Pre-training for Person Re-identification ⬇️

In this paper, we present a large scale unlabeled person re-identification (Re-ID) dataset "LUPerson" and make the first attempt of performing unsupervised pre-training for improving the generalization ability of the learned person Re-ID feature representation. This is to address the problem that all existing person Re-ID datasets are all of limited scale due to the costly effort required for data annotation. Previous research tries to leverage models pre-trained on ImageNet to mitigate the shortage of person Re-ID data but suffers from the large domain gap between ImageNet and person Re-ID data. LUPerson is an unlabeled dataset of 4M images of over 200K identities, which is 30X larger than the largest existing Re-ID dataset. It also covers a much diverse range of capturing environments (eg, camera settings, scenes, etc.). Based on this dataset, we systematically study the key factors for learning Re-ID features from two perspectives: data augmentation and contrastive loss. Unsupervised pre-training performed on this large-scale dataset effectively leads to a generic Re-ID feature that can benefit all existing person Re-ID methods. Using our pre-trained model in some basic frameworks, our methods achieve state-of-the-art results without bells and whistles on four widely used Re-ID datasets: CUHK03, Market1501, DukeMTMC, and MSMT17. Our results also show that the performance improvement is more significant on small-scale target datasets or under few-shot setting.

18.Combining Deep Transfer Learning with Signal-image Encoding for Multi-Modal Mental Wellbeing Classification ⬇️

The quantification of emotional states is an important step to understanding wellbeing. Time series data from multiple modalities such as physiological and motion sensor data have proven to be integral for measuring and quantifying emotions. Monitoring emotional trajectories over long periods of time inherits some critical limitations in relation to the size of the training data. This shortcoming may hinder the development of reliable and accurate machine learning models. To address this problem, this paper proposes a framework to tackle the limitation in performing emotional state recognition on multiple multimodal datasets: 1) encoding multivariate time series data into coloured images; 2) leveraging pre-trained object recognition models to apply a Transfer Learning (TL) approach using the images from step 1; 3) utilising a 1D Convolutional Neural Network (CNN) to perform emotion classification from physiological data; 4) concatenating the pre-trained TL model with the 1D CNN. Furthermore, the possibility of performing TL to infer stress from physiological data is explored by initially training a 1D CNN using a large physical activity dataset and then applying the learned knowledge to the target dataset. We demonstrate that model performance when inferring real-world wellbeing rated on a 5-point Likert scale can be enhanced using our framework, resulting in up to 98.5% accuracy, outperforming a conventional CNN by 4.5%. Subject-independent models using the same approach resulted in an average of 72.3% accuracy (SD 0.038). The proposed CNN-TL-based methodology may overcome problems with small training datasets, thus improving on the performance of conventional deep learning methods.

19.An Enriched Automated PV Registry: Combining Image Recognition and 3D Building Data ⬇️

While photovoltaic (PV) systems are installed at an unprecedented rate, reliable information on an installation level remains scarce. As a result, automatically created PV registries are a timely contribution to optimize grid planning and operations. This paper demonstrates how aerial imagery and three-dimensional building data can be combined to create an address-level PV registry, specifying area, tilt, and orientation angles. We demonstrate the benefits of this approach for PV capacity estimation. In addition, this work presents, for the first time, a comparison between automated and officially-created PV registries. Our results indicate that our enriched automated registry proves to be useful to validate, update, and complement official registries.

20.A New Framework for Registration of Semantic Point Clouds from Stereo and RGB-D Cameras ⬇️

This paper reports on a novel nonparametric rigid point cloud registration framework that jointly integrates geometric and semantic measurements such as color or semantic labels into the alignment process and does not require explicit data association. The point clouds are represented as nonparametric functions in a reproducible kernel Hilbert space. The alignment problem is formulated as maximizing the inner product between two functions, essentially a sum of weighted kernels, each of which exploits the local geometric and semantic features. As a result of the continuous models, analytical gradients can be computed, and a local solution can be obtained by optimization over the rigid body transformation group. Besides, we present a new point cloud alignment metric that is intrinsic to the proposed framework and takes into account geometric and semantic information. The evaluations using publicly available stereo and RGB-D datasets show that the proposed method outperforms state-of-the-art outdoor and indoor frame-to-frame registration methods. An open-source GPU implementation is also provided.

21.Generic Semi-Supervised Adversarial Subject Translation for Sensor-Based Human Activity Recognition ⬇️

The performance of Human Activity Recognition (HAR) models, particularly deep neural networks, is highly contingent upon the availability of the massive amount of annotated training data which should be sufficiently labeled. Though, data acquisition and manual annotation in the HAR domain are prohibitively expensive due to skilled human resource requirements in both steps. Hence, domain adaptation techniques have been proposed to adapt the knowledge from the existing source of data. More recently, adversarial transfer learning methods have shown very promising results in image classification, yet limited for sensor-based HAR problems, which are still prone to the unfavorable effects of the imbalanced distribution of samples. This paper presents a novel generic and robust approach for semi-supervised domain adaptation in HAR, which capitalizes on the advantages of the adversarial framework to tackle the shortcomings, by leveraging knowledge from annotated samples exclusively from the source subject and unlabeled ones of the target subject. Extensive subject translation experiments are conducted on three large, middle, and small-size datasets with different levels of imbalance to assess the robustness and effectiveness of the proposed model to the scale as well as imbalance in the data. The results demonstrate the effectiveness of our proposed algorithms over state-of-the-art methods, which led in up to 13%, 4%, and 13% improvement of our high-level activities recognition metrics for Opportunity, LISSI, and PAMAP2 datasets, respectively. The LISSI dataset is the most challenging one owing to its less populated and imbalanced distribution. Compared to the SA-GAN adversarial domain adaptation method, the proposed approach enhances the final classification performance with an average of 7.5% for the three datasets, which emphasizes the effectiveness of micro-mini-batch training.

22.Roof fall hazard detection with convolutional neural networks using transfer learning ⬇️

Roof falls due to geological conditions are major safety hazards in mining and tunneling industries, causing lost work times, injuries, and fatalities. Several large-opening limestone mines in the Eastern and Midwestern United States have roof fall problems caused by high horizontal stresses. The typical hazard management approach for this type of roof fall hazard relies heavily on visual inspections and expert knowledge. In this study, we propose an artificial intelligence (AI) based system for the detection roof fall hazards caused by high horizontal stresses. We use images depicting hazardous and non-hazardous roof conditions to develop a convolutional neural network for autonomous detection of hazardous roof conditions. To compensate for limited input data, we utilize a transfer learning approach. In transfer learning, an already-trained network is used as a starting point for classification in a similar domain. Results confirm that this approach works well for classifying roof conditions as hazardous or safe, achieving a statistical accuracy of 86%. However, accuracy alone is not enough to ensure a reliable hazard management system. System constraints and reliability are improved when the features being used by the network are understood. Therefore, we used a deep learning interpretation technique called integrated gradients to identify the important geologic features in each image for prediction. The analysis of integrated gradients shows that the system mimics expert judgment on roof fall hazard detection. The system developed in this paper demonstrates the potential of deep learning in geological hazard management to complement human experts, and likely to become an essential part of autonomous tunneling operations in those cases where hazard identification heavily depends on expert knowledge.

23.UNOC: Understanding Occlusion for Embodied Presence in Virtual Reality ⬇️

Tracking body and hand motions in the 3D space is essential for social and self-presence in augmented and virtual environments. Unlike the popular 3D pose estimation setting, the problem is often formulated as inside-out tracking based on embodied perception (e.g., egocentric cameras, handheld sensors). In this paper, we propose a new data-driven framework for inside-out body tracking, targeting challenges of omnipresent occlusions in optimization-based methods (e.g., inverse kinematics solvers). We first collect a large-scale motion capture dataset with both body and finger motions using optical markers and inertial sensors. This dataset focuses on social scenarios and captures ground truth poses under self-occlusions and body-hand interactions. We then simulate the occlusion patterns in head-mounted camera views on the captured ground truth using a ray casting algorithm and learn a deep neural network to infer the occluded body parts. In the experiments, we show that our method is able to generate high-fidelity embodied poses by applying the proposed method on the task of real-time inside-out body tracking, finger motion synthesis, and 3-point inverse kinematics.

24.Generating Natural Questions from Images for Multimodal Assistants ⬇️

Generating natural, diverse, and meaningful questions from images is an essential task for multimodal assistants as it confirms whether they have understood the object and scene in the images properly. The research in visual question answering (VQA) and visual question generation (VQG) is a great step. However, this research does not capture questions that a visually-abled person would ask multimodal assistants. Recently published datasets such as KB-VQA, FVQA, and OK-VQA try to collect questions that look for external knowledge which makes them appropriate for multimodal assistants. However, they still contain many obvious and common-sense questions that humans would not usually ask a digital assistant. In this paper, we provide a new benchmark dataset that contains questions generated by human annotators keeping in mind what they would ask multimodal digital assistants. Large scale annotations for several hundred thousand images are expensive and time-consuming, so we also present an effective way of automatically generating questions from unseen images. In this paper, we present an approach for generating diverse and meaningful questions that consider image content and metadata of image (e.g., location, associated keyword). We evaluate our approach using standard evaluation metrics such as BLEU, METEOR, ROUGE, and CIDEr to show the relevance of generated questions with human-provided questions. We also measure the diversity of generated questions using generative strength and inventiveness metrics. We report new state-of-the-art results on the public and our datasets.

25.G-RCN: Optimizing the Gap between Classification and Localization Tasks for Object Detection ⬇️

Multi-task learning is widely used in computer vision. Currently, object detection models utilize shared feature map to complete classification and localization tasks simultaneously. By comparing the performance between the original Faster R-CNN and that with partially separated feature maps, we show that: (1) Sharing high-level features for the classification and localization tasks is sub-optimal; (2) Large stride is beneficial for classification but harmful for localization; (3) Global context information could improve the performance of classification. Based on these findings, we proposed a paradigm called Gap-optimized region based convolutional network (G-RCN), which aims to separating these two tasks and optimizing the gap between them. The paradigm was firstly applied to correct the current ResNet protocol by simply reducing the stride and moving the Conv5 block from the head to the feature extraction network, which brings 3.6 improvement of AP70 on the PASCAL VOC dataset and 1.5 improvement of AP on the COCO dataset for ResNet50. Next, the new method is applied on the Faster R-CNN with backbone of VGG16,ResNet50 and ResNet101, which brings above 2.0 improvement of AP70 on the PASCAL VOC dataset and above 1.9 improvement of AP on the COCO dataset. Noticeably, the implementation of G-RCN only involves a few structural modifications, with no extra module added.

26.w-Net: Dual Supervised Medical Image Segmentation Model with Multi-Dimensional Attention and Cascade Multi-Scale Convolution ⬇️

Deep learning-based medical image segmentation technology aims at automatic recognizing and annotating objects on the medical image. Non-local attention and feature learning by multi-scale methods are widely used to model network, which drives progress in medical image segmentation. However, those attention mechanism methods have weakly non-local receptive fields' strengthened connection for small objects in medical images. Then, the features of important small objects in abstract or coarse feature maps may be deserted, which leads to unsatisfactory performance. Moreover, the existing multi-scale methods only simply focus on different sizes of view, whose sparse multi-scale features collected are not abundant enough for small objects segmentation. In this work, a multi-dimensional attention segmentation model with cascade multi-scale convolution is proposed to predict accurate segmentation for small objects in medical images. As the weight function, multi-dimensional attention modules provide coefficient modification for significant/informative small objects features. Furthermore, The cascade multi-scale convolution modules in each skip-connection path are exploited to capture multi-scale features in different semantic depth. The proposed method is evaluated on three datasets: KiTS19, Pancreas CT of Decathlon-10, and MICCAI 2018 LiTS Challenge, demonstrating better segmentation performances than the state-of-the-art baselines.

27.Confidence-aware Non-repetitive Multimodal Transformers for TextCaps ⬇️

When describing an image, reading text in the visual scene is crucial to understand the key information. Recent work explores the TextCaps task, \emph{i.e.} image captioning with reading Optical Character Recognition (OCR) tokens, which requires models to read text and cover them in generated captions. Existing approaches fail to generate accurate descriptions because of their (1) poor reading ability; (2) inability to choose the crucial words among all extracted OCR tokens; (3) repetition of words in predicted captions. To this end, we propose a Confidence-aware Non-repetitive Multimodal Transformers (CNMT) to tackle the above challenges. Our CNMT consists of a reading, a reasoning and a generation modules, in which Reading Module employs better OCR systems to enhance text reading ability and a confidence embedding to select the most noteworthy tokens. To address the issue of word redundancy in captions, our Generation Module includes a repetition mask to avoid predicting repeated word in captions. Our model outperforms state-of-the-art models on TextCaps dataset, improving from 81.0 to 93.0 in CIDEr. Our source code is publicly available.

28.Ada-Segment: Automated Multi-loss Adaptation for Panoptic Segmentation ⬇️

Panoptic segmentation that unifies instance segmentation and semantic segmentation has recently attracted increasing attention. While most existing methods focus on designing novel architectures, we steer toward a different perspective: performing automated multi-loss adaptation (named Ada-Segment) on the fly to flexibly adjust multiple training losses over the course of training using a controller trained to capture the learning dynamics. This offers a few advantages: it bypasses manual tuning of the sensitive loss combination, a decisive factor for panoptic segmentation; it allows to explicitly model the learning dynamics, and reconcile the learning of multiple objectives (up to ten in our experiments); with an end-to-end architecture, it generalizes to different datasets without the need of re-tuning hyperparameters or re-adjusting the training process laboriously. Our Ada-Segment brings 2.7% panoptic quality (PQ) improvement on COCO val split from the vanilla baseline, achieving the state-of-the-art 48.5% PQ on COCO test-dev split and 32.9% PQ on ADE20K dataset. The extensive ablation studies reveal the ever-changing dynamics throughout the training process, necessitating the incorporation of an automated and adaptive learning strategy as presented in this paper.

29.PSCNet: Pyramidal Scale and Global Context Guided Network for Crowd Counting ⬇️

Crowd counting, which towards to accurately count the number of the objects in images, has been attracted more and more attention by researchers recently. However, challenges from severely occlusion, large scale variation, complex background interference and non-uniform density distribution, limit the crowd number estimation accuracy. To mitigate above issues, this paper proposes a novel crowd counting approach based on pyramidal scale module (PSM) and global context module (GCM), dubbed PSCNet. Moreover, a reliable supervision manner combined Bayesian and counting loss (BCL) is utilized to learn the density probability and then computes the count exception at each annotation point. Specifically, PSM is used to adaptively capture multi-scale information, which can identify a fine boundary of crowds with different image scales. GCM is devised with low-complexity and lightweight manner, to make the interactive information across the channels of the feature maps more efficient, meanwhile guide the model to select more suitable scales generated from PSM. Furthermore, BL is leveraged to construct a credible density contribution probability supervision manner, which relieves non-uniform density distribution in crowds to a certain extent. Extensive experiments on four crowd counting datasets show the effectiveness and superiority of the proposed model. Additionally, some experiments extended on a remote sensing object counting (RSOC) dataset further validate the generalization ability of the model. Our resource code will be released upon the acceptance of this work.

30.Learned Block Iterative Shrinkage Thresholding Algorithm for Photothermal Super Resolution Imaging ⬇️

Block-sparse regularization is already well-known in active thermal imaging and is used for multiple measurement based inverse problems. The main bottleneck of this method is the choice of regularization parameters which differs for each experiment. To avoid time-consuming manually selected regularization parameter, we propose a learned block-sparse optimization approach using an iterative algorithm unfolded into a deep neural network. More precisely, we show the benefits of using a learned block iterative shrinkage thresholding algorithm that is able to learn the choice of regularization parameters. In addition, this algorithm enables the determination of a suitable weight matrix to solve the underlying inverse problem. Therefore, in this paper we present the algorithm and compare it with state of the art block iterative shrinkage thresholding using synthetically generated test data and experimental test data from active thermography for defect reconstruction. Our results show that the use of the learned block-sparse optimization approach provides smaller normalized mean square errors for a small fixed number of iterations than without learning. Thus, this new approach allows to improve the convergence speed and only needs a few iterations to generate accurate defect reconstruction in photothermal super resolution imaging.

31.End-to-End Object Detection with Fully Convolutional Network ⬇️

Mainstream object detectors based on the fully convolutional network has achieved impressive performance. While most of them still need a hand-designed non-maximum suppression (NMS) post-processing, which impedes fully end-to-end training. In this paper, we give the analysis of discarding NMS, where the results reveal that a proper label assignment plays a crucial role. To this end, for fully convolutional detectors, we introduce a Prediction-aware One-To-One (POTO) label assignment for classification to enable end-to-end detection, which obtains comparable performance with NMS. Besides, a simple 3D Max Filtering (3DMF) is proposed to utilize the multi-scale features and improve the discriminability of convolutions in the local region. With these techniques, our end-to-end framework achieves competitive performance against many state-of-the-art detectors with NMS on COCO and CrowdHuman datasets. The code is available at this https URL .

32.Fine-Grained Dynamic Head for Object Detection ⬇️

The Feature Pyramid Network (FPN) presents a remarkable approach to alleviate the scale variance in object representation by performing instance-level assignments. Nevertheless, this strategy ignores the distinct characteristics of different sub-regions in an instance. To this end, we propose a fine-grained dynamic head to conditionally select a pixel-level combination of FPN features from different scales for each instance, which further releases the ability of multi-scale feature representation. Moreover, we design a spatial gate with the new activation function to reduce computational complexity dramatically through spatially sparse convolutions. Extensive experiments demonstrate the effectiveness and efficiency of the proposed method on several state-of-the-art detection benchmarks. Code is available at this https URL.

33.A Singular Value Perspective on Model Robustness ⬇️

Convolutional Neural Networks (CNNs) have made significant progress on several computer vision benchmarks, but are fraught with numerous non-human biases such as vulnerability to adversarial samples. Their lack of explainability makes identification and rectification of these biases difficult, and understanding their generalization behavior remains an open problem. In this work we explore the relationship between the generalization behavior of CNNs and the Singular Value Decomposition (SVD) of images. We show that naturally trained and adversarially robust CNNs exploit highly different features for the same dataset. We demonstrate that these features can be disentangled by SVD for ImageNet and CIFAR-10 trained networks. Finally, we propose Rank Integrated Gradients (RIG), the first rank-based feature attribution method to understand the dependence of CNNs on image rank.

34.Fine-grained Angular Contrastive Learning with Coarse Labels ⬇️

Few-shot learning methods offer pre-training techniques optimized for easier later adaptation of the model to new classes (unseen during training) using one or a few examples. This adaptivity to unseen classes is especially important for many practical applications where the pre-trained label space cannot remain fixed for effective use and the model needs to be "specialized" to support new categories on the fly. One particularly interesting scenario, essentially overlooked by the few-shot literature, is Coarse-to-Fine Few-Shot (C2FS), where the training classes (e.g. animals) are of much `coarser granularity' than the target (test) classes (e.g. breeds). A very practical example of C2FS is when the target classes are sub-classes of the training classes. Intuitively, it is especially challenging as (both regular and few-shot) supervised pre-training tends to learn to ignore intra-class variability which is essential for separating sub-classes. In this paper, we introduce a novel 'Angular normalization' module that allows to effectively combine supervised and self-supervised contrastive pre-training to approach the proposed C2FS task, demonstrating significant gains in a broad study over multiple baselines and datasets. We hope that this work will help to pave the way for future research on this new, challenging, and very practical topic of C2FS classification.

35.Rethinking Learnable Tree Filter for Generic Feature Transform ⬇️

The Learnable Tree Filter presents a remarkable approach to model structure-preserving relations for semantic segmentation. Nevertheless, the intrinsic geometric constraint forces it to focus on the regions with close spatial distance, hindering the effective long-range interactions. To relax the geometric constraint, we give the analysis by reformulating it as a Markov Random Field and introduce a learnable unary term. Besides, we propose a learnable spanning tree algorithm to replace the original non-differentiable one, which further improves the flexibility and robustness. With the above improvements, our method can better capture long-range dependencies and preserve structural details with linear complexity, which is extended to several vision tasks for more generic feature transform. Extensive experiments on object detection/instance segmentation demonstrate the consistent improvements over the original version. For semantic segmentation, we achieve leading performance (82.1% mIoU) on the Cityscapes benchmark without bells-and-whistles. Code is available at this https URL.

36.Meta Ordinal Regression Forest For Learning with Unsure Lung Nodules ⬇️

Deep learning-based methods have achieved promising performance in early detection and classification of lung nodules, most of which discard unsure nodules and simply deal with a binary classification -- malignant vs benign. Recently, an unsure data model (UDM) was proposed to incorporate those unsure nodules by formulating this problem as an ordinal regression, showing better performance over traditional binary classification. To further explore the ordinal relationship for lung nodule classification, this paper proposes a meta ordinal regression forest (MORF), which improves upon the state-of-the-art ordinal regression method, deep ordinal regression forest (DORF), in three major ways. First, MORF can alleviate the biases of the predictions by making full use of deep features while DORF needs to fix the composition of decision trees before training. Second, MORF has a novel grouped feature selection (GFS) module to re-sample the split nodes of decision trees. Last, combined with GFS, MORF is equipped with a meta learning-based weighting scheme to map the features selected by GFS to tree-wise weights while DORF assigns equal weights for all trees. Experimental results on the LIDC-IDRI dataset demonstrate superior performance over existing methods, including the state-of-the-art DORF.

37.Attention-based Saliency Hashing for Ophthalmic Image Retrieval ⬇️

Deep hashing methods have been proved to be effective for the large-scale medical image search assisting reference-based diagnosis for clinicians. However, when the salient region plays a maximal discriminative role in ophthalmic image, existing deep hashing methods do not fully exploit the learning ability of the deep network to capture the features of salient regions pointedly. The different grades or classes of ophthalmic images may be share similar overall performance but have subtle differences that can be differentiated by mining salient regions. To address this issue, we propose a novel end-to-end network, named Attention-based Saliency Hashing (ASH), for learning compact hash-code to represent ophthalmic images. ASH embeds a spatial-attention module to focus more on the representation of salient regions and highlights their essential role in differentiating ophthalmic images. Benefiting from the spatial-attention module, the information of salient regions can be mapped into the hash-code for similarity calculation. In the training stage, we input the image pairs to share the weights of the network, and a pairwise loss is designed to maximize the discriminability of the hash-code. In the retrieval stage, ASH obtains the hash-code by inputting an image with an end-to-end manner, then the hash-code is used to similarity calculation to return the most similar images. Extensive experiments on two different modalities of ophthalmic image datasets demonstrate that the proposed ASH can further improve the retrieval performance compared to the state-of-the-art deep hashing methods due to the huge contributions of the spatial-attention module.

38.PFA-GAN: Progressive Face Aging with Generative Adversarial Network ⬇️

Face aging is to render a given face to predict its future appearance, which plays an important role in the information forensics and security field as the appearance of the face typically varies with age. Although impressive results have been achieved with conditional generative adversarial networks (cGANs), the existing cGANs-based methods typically use a single network to learn various aging effects between any two different age groups. However, they cannot simultaneously meet three essential requirements of face aging -- including image quality, aging accuracy, and identity preservation -- and usually generate aged faces with strong ghost artifacts when the age gap becomes large. Inspired by the fact that faces gradually age over time, this paper proposes a novel progressive face aging framework based on generative adversarial network (PFA-GAN) to mitigate these issues. Unlike the existing cGANs-based methods, the proposed framework contains several sub-networks to mimic the face aging process from young to old, each of which only learns some specific aging effects between two adjacent age groups. The proposed framework can be trained in an end-to-end manner to eliminate accumulative artifacts and blurriness. Moreover, this paper introduces an age estimation loss to take into account the age distribution for an improved aging accuracy, and proposes to use the Pearson correlation coefficient as an evaluation metric measuring the aging smoothness for face aging methods. Extensively experimental results demonstrate superior performance over existing (c)GANs-based methods, including the state-of-the-art one, on two benchmarked datasets. The source code is available at~\url{this https URL}.

39.VideoMix: Rethinking Data Augmentation for Video Classification ⬇️

State-of-the-art video action classifiers often suffer from overfitting. They tend to be biased towards specific objects and scene cues, rather than the foreground action content, leading to sub-optimal generalization performances. Recent data augmentation strategies have been reported to address the overfitting problems in static image classifiers. Despite the effectiveness on the static image classifiers, data augmentation has rarely been studied for videos. For the first time in the field, we systematically analyze the efficacy of various data augmentation strategies on the video classification task. We then propose a powerful augmentation strategy VideoMix. VideoMix creates a new training video by inserting a video cuboid into another video. The ground truth labels are mixed proportionally to the number of voxels from each video. We show that VideoMix lets a model learn beyond the object and scene biases and extract more robust cues for action recognition. VideoMix consistently outperforms other augmentation baselines on Kinetics and the challenging Something-Something-V2 benchmarks. It also improves the weakly-supervised action localization performance on THUMOS'14. VideoMix pretrained models exhibit improved accuracies on the video detection task (AVA).

40.Hyperspectral Classification Based on Lightweight 3-D-CNN With Transfer Learning ⬇️

Recently, hyperspectral image (HSI) classification approaches based on deep learning (DL) models have been proposed and shown promising performance. However, because of very limited available training samples and massive model parameters, DL methods may suffer from overfitting. In this paper, we propose an end-to-end 3-D lightweight convolutional neural network (CNN) (abbreviated as 3-D-LWNet) for limited samples-based HSI classification. Compared with conventional 3-D-CNN models, the proposed 3-D-LWNet has a deeper network structure, less parameters, and lower computation cost, resulting in better classification performance. To further alleviate the small sample problem, we also propose two transfer learning strategies: 1) cross-sensor strategy, in which we pretrain a 3-D model in the source HSI data sets containing a greater number of labeled samples and then transfer it to the target HSI data sets and 2) cross-modal strategy, in which we pretrain a 3-D model in the 2-D RGB image data sets containing a large number of samples and then transfer it to the target HSI data sets. In contrast to previous approaches, we do not impose restrictions over the source data sets, in which they do not have to be collected by the same sensors as the target data sets. Experiments on three public HSI data sets captured by different sensors demonstrate that our model achieves competitive performance for HSI classification compared to several state-of-the-art methods

41.Selective Pseudo-Labeling with Reinforcement Learning for Semi-Supervised Domain Adaptation ⬇️

Recent domain adaptation methods have demonstrated impressive improvement on unsupervised domain adaptation problems. However, in the semi-supervised domain adaptation (SSDA) setting where the target domain has a few labeled instances available, these methods can fail to improve performance. Inspired by the effectiveness of pseudo-labels in domain adaptation, we propose a reinforcement learning based selective pseudo-labeling method for semi-supervised domain adaptation. It is difficult for conventional pseudo-labeling methods to balance the correctness and representativeness of pseudo-labeled data. To address this limitation, we develop a deep Q-learning model to select both accurate and representative pseudo-labeled instances. Moreover, motivated by large margin loss's capacity on learning discriminative features with little data, we further propose a novel target margin loss for our base model training to improve its discriminability. Our proposed method is evaluated on several benchmark datasets for SSDA, and demonstrates superior performance to all the comparison methods.

42.Interpreting Deep Neural Networks with Relative Sectional Propagation by Analyzing Comparative Gradients and Hostile Activations ⬇️

The clear transparency of Deep Neural Networks (DNNs) is hampered by complex internal structures and nonlinear transformations along deep hierarchies. In this paper, we propose a new attribution method, Relative Sectional Propagation (RSP), for fully decomposing the output predictions with the characteristics of class-discriminative attributions and clear objectness. We carefully revisit some shortcomings of backpropagation-based attribution methods, which are trade-off relations in decomposing DNNs. We define hostile factor as an element that interferes with finding the attributions of the target and propagate it in a distinguishable way to overcome the non-suppressed nature of activated neurons. As a result, it is possible to assign the bi-polar relevance scores of the target (positive) and hostile (negative) attributions while maintaining each attribution aligned with the importance. We also present the purging techniques to prevent the decrement of the gap between the relevance scores of the target and hostile attributions during backward propagation by eliminating the conflicting units to channel attribution map. Therefore, our method makes it possible to decompose the predictions of DNNs with clearer class-discriminativeness and detailed elucidations of activation neurons compared to the conventional attribution methods. In a verified experimental environment, we report the results of the assessments: (i) Pointing Game, (ii) mIoU, and (iii) Model Sensitivity with PASCAL VOC 2007, MS COCO 2014, and ImageNet datasets. The results demonstrate that our method outperforms existing backward decomposition methods, including distinctive and intuitive visualizations.

43.Boosting Image Super-Resolution Via Fusion of Complementary Information Captured by Multi-Modal Sensors ⬇️

Image Super-Resolution (SR) provides a promising technique to enhance the image quality of low-resolution optical sensors, facilitating better-performing target detection and autonomous navigation in a wide range of robotics applications. It is noted that the state-of-the-art SR methods are typically trained and tested using single-channel inputs, neglecting the fact that the cost of capturing high-resolution images in different spectral domains varies significantly. In this paper, we attempt to leverage complementary information from a low-cost channel (visible/depth) to boost image quality of an expensive channel (thermal) using fewer parameters. To this end, we first present an effective method to virtually generate pixel-wise aligned visible and thermal images based on real-time 3D reconstruction of multi-modal data captured at various viewpoints. Then, we design a feature-level multispectral fusion residual network model to perform high-accuracy SR of thermal images by adaptively integrating co-occurrence features presented in multispectral images. Experimental results demonstrate that this new approach can effectively alleviate the ill-posed inverse problem of image SR by taking into account complementary information from an additional low-cost channel, significantly outperforming state-of-the-art SR approaches in terms of both accuracy and efficiency.

44.PMP-Net: Point Cloud Completion by Learning Multi-step Point Moving Paths ⬇️

The task of point cloud completion aims to predict the missing part for an incomplete 3D shape. A widely used strategy is to generate a complete point cloud from the incomplete one. However, the unordered nature of point clouds will degrade the generation of high-quality 3D shapes, as the detailed topology and structure of discrete points are hard to be captured by the generative process only using a latent code. In this paper, we address the above problem by reconsidering the completion task from a new perspective, where we formulate the prediction as a point cloud deformation process. Specifically, we design a novel neural network, named PMP-Net, to mimic the behavior of an earth mover. It moves move each point of the incomplete input to complete the point cloud, where the total distance of point moving paths (PMP) should be shortest. Therefore, PMP-Net predicts a unique point moving path for each point according to the constraint of total point moving distances. As a result, the network learns a strict and unique correspondence on point-level, which can capture the detailed topology and structure relationships between the incomplete shape and the complete target, and thus improves the quality of the predicted complete shape. We conduct comprehensive experiments on Completion3D and PCN datasets, which demonstrate our advantages over the state-of-the-art point cloud completion methods.

45.CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation ⬇️

Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video. Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects and they suffer in the video scenario due to several distinct challenges such as motion blur and drastic appearance change. To eliminate ambiguities introduced by only using single-frame features, we propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information. The aggregation process is carefully designed with a new attention mechanism which significantly increases the discriminative power of the learned features. We further improve the tracking capability of our model through a siamese design by incorporating both feature similarities and spatial similarities. Experiments conducted on the YouTube-VIS dataset validate the effectiveness of proposed CompFeat. Our code will be available at this https URL.

46.Art Style Classification with Self-Trained Ensemble of AutoEncoding Transformations ⬇️

The artistic style of a painting is a rich descriptor that reveals both visual and deep intrinsic knowledge about how an artist uniquely portrays and expresses their creative vision. Accurate categorization of paintings across different artistic movements and styles is critical for large-scale indexing of art databases. However, the automatic extraction and recognition of these highly dense artistic features has received little to no attention in the field of computer vision research. In this paper, we investigate the use of deep self-supervised learning methods to solve the problem of recognizing complex artistic styles with high intra-class and low inter-class variation. Further, we outperform existing approaches by almost 20% on a highly class imbalanced WikiArt dataset with 27 art categories. To achieve this, we train the EnAET semi-supervised learning model (Wang et al., 2019) with limited annotated data samples and supplement it with self-supervised representations learned from an ensemble of spatial and non-spatial transformations.

47.Proactive Pseudo-Intervention: Causally Informed Contrastive Learning For Interpretable Vision Models ⬇️

Deep neural networks have shown significant promise in comprehending complex visual signals, delivering performance on par or even superior to that of human experts. However, these models often lack a mechanism for interpreting their predictions, and in some cases, particularly when the sample size is small, existing deep learning solutions tend to capture spurious correlations that compromise model generalizability on unseen inputs. In this work, we propose a contrastive causal representation learning strategy that leverages proactive interventions to identify causally-relevant image features, called Proactive Pseudo-Intervention (PPI). This approach is complemented with a causal salience map visualization module, i.e., Weight Back Propagation (WBP), that identifies important pixels in the raw input image, which greatly facilitates the interpretability of predictions. To validate its utility, our model is benchmarked extensively on both standard natural images and challenging medical image datasets. We show this new contrastive causal representation learning model consistently improves model performance relative to competing solutions, particularly for out-of-domain predictions or when dealing with data integration from heterogeneous sources. Further, our causal saliency maps are more succinct and meaningful relative to their non-causal counterparts.

48.Visual Aware Hierarchy Based Food Recognition ⬇️

Food recognition is one of the most important components in image-based dietary assessment. However, due to the different complexity level of food images and inter-class similarity of food categories, it is challenging for an image-based food recognition system to achieve high accuracy for a variety of publicly available datasets. In this work, we propose a new two-step food recognition system that includes food localization and hierarchical food classification using Convolutional Neural Networks (CNNs) as the backbone architecture. The food localization step is based on an implementation of the Faster R-CNN method to identify food regions. In the food classification step, visually similar food categories can be clustered together automatically to generate a hierarchical structure that represents the semantic visual relations among food categories, then a multi-task CNN model is proposed to perform the classification task based on the visual aware hierarchical structure. Since the size and quality of dataset is a key component of data driven methods, we introduce a new food image dataset, VIPER-FoodNet (VFN) dataset, consists of 82 food categories with 15k images based on the most commonly consumed foods in the United States. A semi-automatic crowdsourcing tool is used to provide the ground-truth information for this dataset including food object bounding boxes and food object labels. Experimental results demonstrate that our system can significantly improve both classification and recognition performance on 4 publicly available datasets and the new VFN dataset.

49.Self-Training for Class-Incremental Semantic Segmentation ⬇️

We study incremental learning for semantic segmentation where when learning new classes we have no access to the labeled data of previous tasks. When incrementally learning new classes, deep neural networks suffer from catastrophic forgetting of previous learned knowledge. To address this problem, we propose to apply a self-training approach that leverages unlabeled data, which is used for rehearsal of previous knowledge. Additionally, conflict reduction is proposed to resolve the conflicts of pseudo labels generated from both the old and new models. We show that maximizing self-entropy can further improve results by smoothing the overconfident predictions. The experiments demonstrate state-of-the-art results: obtaining a relative gain of up to 114% on Pascal-VOC 2012 and 8.5% on the more challenging ADE20K compared to previous state-of-the-art methods.

50.Select, Label, and Mix: Learning Discriminative Invariant Feature Representations for Partial Domain Adaptation ⬇️

Partial domain adaptation which assumes that the unknown target label space is a subset of the source label space has attracted much attention in computer vision. Despite recent progress, existing methods often suffer from three key problems: negative transfer, lack of discriminability and domain invariance in the latent space. To alleviate the above issues, we develop a novel 'Select, Label, and Mix' (SLM) framework that aims to learn discriminative invariant feature representations for partial domain adaptation. First, we present a simple yet efficient "select" module that automatically filters out the outlier source samples to avoid negative transfer while aligning distributions across both domains. Second, the "label" module iteratively trains the classifier using both the labeled source domain data and the generated pseudo-labels for the target domain to enhance the discriminability of the latent space. Finally, the "mix" module utilizes domain mixup regularization jointly with the other two modules to explore more intrinsic structures across domains leading to a domain-invariant latent space for partial domain adaptation. Extensive experiments on several benchmark datasets demonstrate the superiority of our proposed framework over state-of-the-art methods.

51.Rethinking FUN: Frequency-Domain Utilization Networks ⬇️

The search for efficient neural network architectures has gained much focus in recent years, where modern architectures focus not only on accuracy but also on inference time and model size. Here, we present FUN, a family of novel Frequency-domain Utilization Networks. These networks utilize the inherent efficiency of the frequency-domain by working directly in that domain, represented with the Discrete Cosine Transform. Using modern techniques and building blocks such as compound-scaling and inverted-residual layers we generate a set of such networks allowing one to balance between size, latency and accuracy while outperforming competing RGB-based models. Extensive evaluations verifies that our networks present strong alternatives to previous approaches. Moreover, we show that working in frequency domain allows for dynamic compression of the input at inference time without any explicit change to the architecture.

52.Improving Auto-Encoders' self-supervised image classification using pseudo-labelling via data augmentation and the perceptual loss ⬇️

In this paper, we introduce a novel method to pseudo-label unlabelled images and train an Auto-Encoder to classify them in a self-supervised manner that allows for a high accuracy and consistency across several datasets. The proposed method consists of first applying a randomly sampled set of data augmentation transformations to each training image. As a result, each initial image can be considered as a pseudo-label to its corresponding augmented ones. Then, an Auto-Encoder is used to learn the mapping between each set of the augmented images and its corresponding pseudo-label. Furthermore, the perceptual loss is employed to take into consideration the existing dependencies between the pixels in the same neighbourhood of an image. This combination encourages the encoder to output richer encodings that are highly informative of the input's class. Consequently, the Auto-Encoder's performance on unsupervised image classification is improved both in termes of stability and accuracy becoming more uniform and more consistent across all tested datasets. Previous state-of-the-art accuracy on the MNIST, CIFAR-10 and SVHN datasets is improved by 0.3%, 3.11% and 9.21% respectively.

53.Efficient Human Pose Estimation with Depthwise Separable Convolution and Person Centroid Guided Joint Grouping ⬇️

In this paper, we propose efficient and effective methods for 2D human pose estimation. A new ResBlock is proposed based on depthwise separable convolution and is utilized instead of the original one in Hourglass network. It can be further enhanced by replacing the vanilla depthwise convolution with a mixed depthwise convolution. Based on it, we propose a bottom-up multi-person pose estimation method. A rooted tree is used to represent human pose by introducing person centroid as the root which connects to all body joints directly or hierarchically. Two branches of sub-networks are used to predict the centroids, body joints and their offsets to their parent nodes. Joints are grouped by tracing along their offsets to the closest centroids. Experimental results on the MPII human dataset and the LSP dataset show that both our single-person and multi-person pose estimation methods can achieve competitive accuracies with low computational costs.

54.TediGAN: Text-Guided Diverse Image Generation and Manipulation ⬇️

In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module is to train an image encoder to map real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity is to learn the text-image matching by mapping the image and text into a common embedding space. The instance-level optimization is for identity preservation in manipulation. Our model can provide the lowest effect guarantee, and produce diverse and high-quality images with an unprecedented resolution at 1024. Using a control mechanism based on style-mixing, our TediGAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels with or without instance (text or real image) guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at this https URL.

55.Pedestrian Behavior Prediction via Multitask Learning and Categorical Interaction Modeling ⬇️

Pedestrian behavior prediction is one of the major challenges for intelligent driving systems. Pedestrians often exhibit complex behaviors influenced by various contextual elements. To address this problem, we propose a multitask learning framework that simultaneously predicts trajectories and actions of pedestrians by relying on multimodal data. Our method benefits from 1) a hybrid mechanism to encode different input modalities independently allowing them to develop their own representations, and jointly to produce a representation for all modalities using shared parameters; 2) a novel interaction modeling technique that relies on categorical semantic parsing of the scenes to capture interactions between target pedestrians and their surroundings; and 3) a dual prediction mechanism that uses both independent and shared decoding of multimodal representations. Using public pedestrian behavior benchmark datasets for driving, PIE and JAAD, we highlight the benefits of multitask learning for behavior prediction and show that our model achieves state-of-the-art performance and improves trajectory and action prediction by up to 22% and 6% respectively. We further investigate the contributions of the proposed processing and interaction modeling techniques via extensive ablation studies.

56.Towards Better Object Detection in Scale Variation with Adaptive Feature Selection ⬇️

It is a common practice to exploit pyramidal feature representation to tackle the problem of scale variation in object instances. However, most of them still predict the objects in a certain range of scales based solely or mainly on a single-level representation, yielding inferior detection performance. To this end, we propose a novel adaptive feature selection module (AFSM), to automatically learn the way to fuse multi-level representations in the channel dimension, in a data-driven manner. It significantly improves the performance of the detectors that have a feature pyramid structure, while introducing nearly free inference overhead. Moreover, we propose a class-aware sampling mechanism to tackle the class imbalance problem, which automatically assigns the sampling weight to each of the images during training, according to the number of objects in each class. Experimental results demonstrate the effectiveness of the proposed method, with 83.04% mAP at 15.96 FPS on the VOC dataset, and 39.48% AP on the VisDrone-DET validation subset, respectively, outperforming other state-of-the-art detectors considerably. The code will be available soon at this https URL.

57.Cross-Layer Distillation with Semantic Calibration ⬇️

Recently proposed knowledge distillation approaches based on feature-map transfer validate that intermediate layers of a teacher model can serve as effective targets for training a student model to obtain better generalization ability. Existing studies mainly focus on particular representation forms for knowledge transfer between manually specified pairs of teacher-student intermediate layers. However, semantics of intermediate layers may vary in different networks and manual association of layers might lead to negative regularization caused by semantic mismatch between certain teacher-student layer pairs. To address this problem, we propose Semantic Calibration for Cross-layer Knowledge Distillation (SemCKD), which automatically assigns proper target layers of the teacher model for each student layer with an attention mechanism. With a learned attention distribution, each student layer distills knowledge contained in multiple layers rather than a single fixed intermediate layer from the teacher model for appropriate cross-layer supervision in training. Consistent improvements over state-of-the-art approaches are observed in extensive experiments with various network architectures for teacher and student models, demonstrating the effectiveness and flexibility of the proposed attention based soft layer association mechanism for cross-layer distillation.

58.Spatiotemporal tomography based on scattered multiangular signals and its application for resolving evolving clouds using moving platforms ⬇️

We derive computed tomography (CT) of a time-varying volumetric translucent object, using a small number of moving cameras. We particularly focus on passive scattering tomography, which is a non-linear problem. We demonstrate the approach on dynamic clouds, as clouds have a major effect on Earth's climate. State of the art scattering CT assumes a static object. Existing 4D CT methods rely on a linear image formation model and often on significant priors. In this paper, the angular and temporal sampling rates needed for a proper recovery are discussed. If these rates are used, the paper leads to a representation of the time-varying object, which simplifies 4D CT tomography. The task is achieved using gradient-based optimization. We demonstrate this in physics-based simulations and in an experiment that had yielded real-world data.

59.Skeleon-Based Typing Style Learning For Person Identification ⬇️

We present a novel architecture for person identification based on typing-style, constructed of adaptive non-local spatio-temporal graph convolutional network. Since type style dynamics convey meaningful information that can be useful for person identification, we extract the joints positions and then learn their movements' dynamics. Our non-local approach increases our model's robustness to noisy input data while analyzing joints locations instead of RGB data provides remarkable robustness to alternating environmental conditions, e.g., lighting, noise, etc. We further present two new datasets for typing style based person identification task and extensive evaluation that displays our model's superior discriminative and generalization abilities, when compared with state-of-the-art skeleton-based models.

60.MVHM: A Large-Scale Multi-View Hand Mesh Benchmark for Accurate 3D Hand Pose Estimation ⬇️

Estimating 3D hand poses from a single RGB image is challenging because depth ambiguity leads the problem ill-posed. Training hand pose estimators with 3D hand mesh annotations and multi-view images often results in significant performance gains. However, existing multi-view datasets are relatively small with hand joints annotated by off-the-shelf trackers or automated through model predictions, both of which may be inaccurate and can introduce biases. Collecting a large-scale multi-view 3D hand pose images with accurate mesh and joint annotations is valuable but strenuous. In this paper, we design a spin match algorithm that enables a rigid mesh model matching with any target mesh ground truth. Based on the match algorithm, we propose an efficient pipeline to generate a large-scale multi-view hand mesh (MVHM) dataset with accurate 3D hand mesh and joint labels. We further present a multi-view hand pose estimation approach to verify that training a hand pose estimator with our generated dataset greatly enhances the performance. Experimental results show that our approach achieves the performance of 0.990 in $\text{AUC}_{\text{20-50}}$ on the MHP dataset compared to the previous state-of-the-art of 0.939 on this dataset. Our datasset is public available. \footnote{\url{this https URL}} Our datasset is available at~\href{this https URL}{\color{blue}{this https URL}}.

61.Temporal-Aware Self-Supervised Learning for 3D Hand Pose and Mesh Estimation in Videos ⬇️

Estimating 3D hand pose directly from RGB imagesis challenging but has gained steady progress recently bytraining deep models with annotated 3D poses. Howeverannotating 3D poses is difficult and as such only a few 3Dhand pose datasets are available, all with limited samplesizes. In this study, we propose a new framework of training3D pose estimation models from RGB images without usingexplicit 3D annotations, i.e., trained with only 2D informa-tion. Our framework is motivated by two observations: 1)Videos provide richer information for estimating 3D posesas opposed to static images; 2) Estimated 3D poses oughtto be consistent whether the videos are viewed in the for-ward order or reverse order. We leverage these two obser-vations to develop a self-supervised learning model calledtemporal-aware self-supervised network (TASSN). By en-forcing temporal consistency constraints, TASSN learns 3Dhand poses and meshes from videos with only 2D keypointposition annotations. Experiments show that our modelachieves surprisingly good results, with 3D estimation ac-curacy on par with the state-of-the-art models trained with3D annotations, highlighting the benefit of the temporalconsistency in constraining 3D prediction models.

62.DGGAN: Depth-image Guided Generative Adversarial Networks for Disentangling RGB and Depth Images in 3D Hand Pose Estimation ⬇️

Estimating3D hand poses from RGB images is essentialto a wide range of potential applications, but is challengingowing to substantial ambiguity in the inference of depth in-formation from RGB images. State-of-the-art estimators ad-dress this problem by regularizing3D hand pose estimationmodels during training to enforce the consistency betweenthe predicted3D poses and the ground-truth depth maps.However, these estimators rely on both RGB images and thepaired depth maps during training. In this study, we proposea conditional generative adversarial network (GAN) model,called Depth-image Guided GAN (DGGAN), to generate re-alistic depth maps conditioned on the input RGB image, anduse the synthesized depth maps to regularize the3D handpose estimation model, therefore eliminating the need forground-truth depth maps. Experimental results on multiplebenchmark datasets show that the synthesized depth mapsproduced by DGGAN are quite effective in regularizing thepose estimation model, yielding new state-of-the-art resultsin estimation accuracy, notably reducing the mean3D end-point errors (EPE) by4.7%,16.5%, and6.8%on the RHD,STB and MHP datasets, respectively.

63.Online Adaptation for Consistent Mesh Reconstruction in the Wild ⬇️

This paper presents an algorithm to reconstruct temporally consistent 3D meshes of deformable object instances from videos in the wild. Without requiring annotations of 3D mesh, 2D keypoints, or camera pose for each video frame, we pose video-based reconstruction as a self-supervised online adaptation problem applied to any incoming test video. We first learn a category-specific 3D reconstruction model from a collection of single-view images of the same category that jointly predicts the shape, texture, and camera pose of an image. Then, at inference time, we adapt the model to a test video over time using self-supervised regularization terms that exploit temporal consistency of an object instance to enforce that all reconstructed meshes share a common texture map, a base shape, as well as parts. We demonstrate that our algorithm recovers temporally consistent and reliable 3D structures from videos of non-rigid objects including those of animals captured in the wild -- an extremely challenging task rarely addressed before.

64.Depth Completion using Piecewise Planar Model ⬇️

A depth map can be represented by a set of learned bases and can be efficiently solved in a closed form solution. However, one issue with this method is that it may create artifacts when colour boundaries are inconsistent with depth boundaries. In fact, this is very common in a natural image. To address this issue, we enforce a more strict model in depth recovery: a piece-wise planar model. More specifically, we represent the desired depth map as a collection of 3D planar and the reconstruction problem is formulated as the optimization of planar parameters. Such a problem can be formulated as a continuous CRF optimization problem and can be solved through particle based method (MP-PBP) \cite{Yamaguchi14}. Extensive experimental evaluations on the KITTI visual odometry dataset show that our proposed methods own high resistance to false object boundaries and can generate useful and visually pleasant 3D point clouds.

65.Computer Stereo Vision for Autonomous Driving ⬇️

For a long time, autonomous cars were found only in science fiction movies and series but now they are becoming a reality. The opportunity to pick such a vehicle at your garage forecourt is closer than you think. As an important component of autonomous systems, autonomous car perception has had a big leap with recent advances in parallel computing architectures, such as OpenMP for multi-threading CPUs and OpenCL for GPUs. With the use of tiny but full-feature embedded supercomputers, computer stereo vision has been prevalently applied in autonomous cars for depth perception. The two key aspects of computer stereo vision are speed and accuracy. They are two desirable but conflicting properties -- the algorithms with better disparity accuracy usually have higher computational complexity. Therefore, the main aim of developing a computer stereo vision algorithm for resource-limited hardware is to improve the trade-off between speed and accuracy. In this chapter, we first introduce the autonomous car system, from the hardware aspect to the software aspect. We then discuss four autonomous car perception functionalities, including: 1) visual feature detection, description and matching, 2) 3D information acquisition, 3) object detection/recognition and 4) semantic image segmentation. Finally, we introduce the principles of computer stereo vision and parallel computing.

66.Maximum Entropy Subspace Clustering Network ⬇️

Deep subspace clustering network (DSC-Net) and its numerous variants have achieved impressive performance for subspace clustering, in which an auto-encoder non-linearly maps input data into a latent space, and a fully connected layer named self-expressiveness module is introduced between the encoder and the decoder to learn an affinity matrix. However, the adopted regularization on the affinity matrix (e.g., sparse, Tikhonov, or low-rank) is still insufficient to drive the learning of an ideal affinity matrix, thus limiting their performance. In addition, in DSC-Net, the self-expressiveness module and the auto-encoder module are tightly coupled, making the training of the DSC-Net non-trivial. To this end, in this paper, we propose a novel deep learning-based clustering method named Maximum Entropy Subspace Clustering Network (MESC-Net). Specifically, MESC-Net maximizes the learned affinity matrix's entropy to encourage it to exhibit an ideal affinity matrix structure. We theoretically prove that the affinity matrix driven by MESC-Net obeys the block-diagonal property, and experimentally show that its elements corresponding to the same subspace are uniformly and densely distributed, which gives better clustering performance. Moreover, we explicitly decouple the auto-encoder module and the self-expressiveness module. Extensive quantitative and qualitative results on commonly used benchmark datasets validate MESC-Net significantly outperforms state-of-the-art methods.

67.Food Classification with Convolutional Neural Networks and Multi-Class Linear Discernment Analysis ⬇️

Convolutional neural networks (CNNs) have been successful in representing the fully-connected inferencing ability perceived to be seen in the human brain: they take full advantage of the hierarchy-style patterns commonly seen in complex data and develop more patterns using simple features. Countless implementations of CNNs have shown how strong their ability is to learn these complex patterns, particularly in the realm of image classification. However, the cost of getting a high performance CNN to a so-called "state of the art" level is computationally costly. Even when using transfer learning, which utilize the very deep layers from models such as MobileNetV2, CNNs still take a great amount of time and resources. Linear discriminant analysis (LDA), a generalization of Fisher's linear discriminant, can be implemented in a multi-class classification method to increase separability of class features while not needing a high performance system to do so for image classification. Similarly, we also believe LDA has great promise in performing well. In this paper, we discuss our process of developing a robust CNN for food classification as well as our effective implementation of multi-class LDA and prove that (1) CNN is superior to LDA for image classification and (2) why LDA should not be left out of the races for image classification, particularly for binary cases.

68.Any-Width Networks ⬇️

Despite remarkable improvements in speed and accuracy, convolutional neural networks (CNNs) still typically operate as monolithic entities at inference time. This poses a challenge for resource-constrained practical applications, where both computational budgets and performance needs can vary with the situation. To address these constraints, we propose the Any-Width Network (AWN), an adjustable-width CNN architecture and associated training routine that allow for fine-grained control over speed and accuracy during inference. Our key innovation is the use of lower-triangular weight matrices which explicitly address width-varying batch statistics while being naturally suited for multi-width operations. We also show that this design facilitates an efficient training routine based on random width sampling. We empirically demonstrate that our proposed AWNs compare favorably to existing methods while providing maximally granular control during inference.

69.Automatic sampling and training method for wood-leaf classification based on tree terrestrial point cloud ⬇️

Terrestrial laser scanning technology provides an efficient and accuracy solution for acquiring three-dimensional information of plants. The leaf-wood classification of plant point cloud data is a fundamental step for some forestry and biological research. An automatic sampling and training method for classification was proposed based on tree point cloud data. The plane fitting method was used for selecting leaf sample points and wood sample points automatically, then two local features were calculated for training and classification by using support vector machine (SVM) algorithm. The point cloud data of ten trees were tested by using the proposed method and a manual selection method. The average correct classification rate and kappa coefficient are 0.9305 and 0.7904, respectively. The results show that the proposed method had better efficiency and accuracy comparing to the manual selection method.

70.YieldNet: A Convolutional Neural Network for Simultaneous Corn and Soybean Yield Prediction Based on Remote Sensing Data ⬇️

Large scale crop yield estimation is, in part, made possible due to the availability of remote sensing data allowing for the continuous monitoring of crops throughout its growth state. Having this information allows stakeholders the ability to make real-time decisions to maximize yield potential. Although various models exist that predict yield from remote sensing data, there currently does not exist an approach that can estimate yield for multiple crops simultaneously, and thus leads to more accurate predictions. A model that predicts yield of multiple crops and concurrently considers the interaction between multiple crop's yield. We propose a new model called YieldNet which utilizes a novel deep learning framework that uses transfer learning between corn and soybean yield predictions by sharing the weights of the backbone feature extractor. Additionally, to consider the multi-target response variable, we propose a new loss function. Numerical results demonstrate that our proposed method accurately predicts yield from one to four months before the harvest, and is competitive to other state-of-the-art approaches.

71.It's All Around You: Range-Guided Cylindrical Network for 3D Object Detection ⬇️

Modern perception systems in the field of autonomous driving rely on 3D data analysis. LiDAR sensors are frequently used to acquire such data due to their increased resilience to different lighting conditions. Although rotating LiDAR scanners produce ring-shaped patterns in space, most networks analyze their data using an orthogonal voxel sampling strategy. This work presents a novel approach for analyzing 3D data produced by 360-degree depth scanners, utilizing a more suitable coordinate system, which is aligned with the scanning pattern. Furthermore, we introduce a novel notion of range-guided convolutions, adapting the receptive field by distance from the ego vehicle and the object's scale. Our network demonstrates powerful results on the nuScenes challenge, comparable to current state-of-the-art architectures. The backbone architecture introduced in this work can be easily integrated onto other pipelines as well.

72.LandCoverNet: A global benchmark land cover classification training dataset ⬇️

Regularly updated and accurate land cover maps are essential for monitoring 14 of the 17 Sustainable Development Goals. Multispectral satellite imagery provide high-quality and valuable information at global scale that can be used to develop land cover classification models. However, such a global application requires a geographically diverse training dataset. Here, we present LandCoverNet, a global training dataset for land cover classification based on Sentinel-2 observations at 10m spatial resolution. Land cover class labels are defined based on annual time-series of Sentinel-2, and verified by consensus among three human annotators.

73.Spectral Distribution aware Image Generation ⬇️

Recent advances in deep generative models for photo-realistic images have led to high quality visual results. Such models learn to generate data from a given training distribution such that generated images can not be easily distinguished from real images by the human eye. Yet, recent work on the detection of such fake images pointed out that they are actually easily distinguishable by artifacts in their frequency spectra. In this paper, we propose to generate images according to the frequency distribution of the real data by employing a spectral discriminator. The proposed discriminator is lightweight, modular and works stably with different commonly used GAN losses. We show that the resulting models can better generate images with realistic frequency spectra, which are thus harder to detect by this cue.

74.Generating Synthetic Multispectral Satellite Imagery from Sentinel-2 ⬇️

Multi-spectral satellite imagery provides valuable data at global scale for many environmental and socio-economic applications. Building supervised machine learning models based on these imagery, however, may require ground reference labels which are not available at global scale. Here, we propose a generative model to produce multi-resolution multi-spectral imagery based on Sentinel-2 data. The resulting synthetic images are indistinguishable from real ones by humans. This technique paves the road for future work to generate labeled synthetic imagery that can be used for data augmentation in data scarce regions and applications.

75.Semantic Segmentation of Medium-Resolution Satellite Imagery using Conditional Generative Adversarial Networks ⬇️

Semantic segmentation of satellite imagery is a common approach to identify patterns and detect changes around the planet. Most of the state-of-the-art semantic segmentation models are trained in a fully supervised way using Convolutional Neural Network (CNN). The generalization property of CNN is poor for satellite imagery because the data can be very diverse in terms of landscape types, image resolutions, and scarcity of labels for different geographies and seasons. Hence, the performance of CNN doesn't translate well to images from unseen regions or seasons. Inspired by Conditional Generative Adversarial Networks (CGAN) based approach of image-to-image translation for high-resolution satellite imagery, we propose a CGAN framework for land cover classification using medium-resolution Sentinel-2 imagery. We find that the CGAN model outperforms the CNN model of similar complexity by a significant margin on an unseen imbalanced test dataset.

76.MyFood: A Food Segmentation and Classification System to Aid Nutritional Monitoring ⬇️

The absence of food monitoring has contributed significantly to the increase in the population's weight. Due to the lack of time and busy routines, most people do not control and record what is consumed in their diet. Some solutions have been proposed in computer vision to recognize food images, but few are specialized in nutritional monitoring. This work presents the development of an intelligent system that classifies and segments food presented in images to help the automatic monitoring of user diet and nutritional intake. This work shows a comparative study of state-of-the-art methods for image classification and segmentation, applied to food recognition. In our methodology, we compare the FCN, ENet, SegNet, DeepLabV3+, and Mask RCNN algorithms. We build a dataset composed of the most consumed Brazilian food types, containing nine classes and a total of 1250 images. The models were evaluated using the following metrics: Intersection over Union, Sensitivity, Specificity, Balanced Precision, and Positive Predefined Value. We also propose an system integrated into a mobile application that automatically recognizes and estimates the nutrients in a meal, assisting people with better nutritional monitoring. The proposed solution showed better results than the existing ones in the market. The dataset is publicly available at the following link this http URL

77.Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction ⬇️

We present dynamic neural radiance fields for modeling the appearance and dynamics of a human face. Digitally modeling and reconstructing a talking human is a key building-block for a variety of applications. Especially, for telepresence applications in AR or VR, a faithful reproduction of the appearance including novel viewpoints or head-poses is required. In contrast to state-of-the-art approaches that model the geometry and material properties explicitly, or are purely image-based, we introduce an implicit representation of the head based on scene representation networks. To handle the dynamics of the face, we combine our scene representation network with a low-dimensional morphable model which provides explicit control over pose and expressions. We use volumetric rendering to generate images from this hybrid representation and demonstrate that such a dynamic neural scene representation can be learned from monocular input data only, without the need of a specialized capture setup. In our experiments, we show that this learned volumetric representation allows for photo-realistic image generation that surpasses the quality of state-of-the-art video-based reenactment methods.

78.Self-Supervised Visual Representation Learning from Hierarchical Grouping ⬇️

We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy. A small supervised dataset suffices for training this grouping primitive. Across a large unlabeled dataset, we apply this learned primitive to automatically predict hierarchical region structure. These predictions serve as guidance for self-supervised contrastive feature learning: we task a deep network with producing per-pixel embeddings whose pairwise distances respect the region hierarchy. Experiments demonstrate that our approach can serve as state-of-the-art generic pre-training, benefiting downstream tasks. We additionally explore applications to semantic region search and video-based object instance tracking.

79.Understanding Bird's-Eye View Semantic HD-Maps Using an Onboard Monocular Camera ⬇️

Autonomous navigation requires scene understanding of the action-space to move or anticipate events. For planner agents moving on the ground plane, such as autonomous vehicles, this translates to scene understanding in the bird's-eye view. However, the onboard cameras of autonomous cars are customarily mounted horizontally for a better view of the surrounding. In this work, we study scene understanding in the form of online estimation of semantic bird's-eye-view HD-maps using the video input from a single onboard camera. We study three key aspects of this task, image-level understanding, BEV level understanding, and the aggregation of temporal information. Based on these three pillars we propose a novel architecture that combines these three aspects. In our extensive experiments, we demonstrate that the considered aspects are complementary to each other for HD-map understanding. Furthermore, the proposed architecture significantly surpasses the current state-of-the-art.

80.ParaNet: Deep Regular Representation for 3D Point Clouds ⬇️

Although convolutional neural networks have achieved remarkable success in analyzing 2D images/videos, it is still non-trivial to apply the well-developed 2D techniques in regular domains to the irregular 3D point cloud data. To bridge this gap, we propose ParaNet, a novel end-to-end deep learning framework, for representing 3D point clouds in a completely regular and nearly lossless manner. To be specific, ParaNet converts an irregular 3D point cloud into a regular 2D color image, named point geometry image (PGI), where each pixel encodes the spatial coordinates of a point. In contrast to conventional regular representation modalities based on multi-view projection and voxelization, the proposed representation is differentiable and reversible. Technically, ParaNet is composed of a surface embedding module, which parameterizes 3D surface points onto a unit square, and a grid resampling module, which resamples the embedded 2D manifold over regular dense grids. Note that ParaNet is unsupervised, i.e., the training simply relies on reference-free geometry constraints. The PGIs can be seamlessly coupled with a task network established upon standard and mature techniques for 2D images/videos to realize a specific task for 3D point clouds. We evaluate ParaNet over shape classification and point cloud upsampling, in which our solutions perform favorably against the existing state-of-the-art methods. We believe such a paradigm will open up many possibilities to advance the progress of deep learning-based point cloud processing and understanding.

81.Depth estimation from 4D light field videos ⬇️

Depth (disparity) estimation from 4D Light Field (LF) images has been a research topic for the last couple of years. Most studies have focused on depth estimation from static 4D LF images while not considering temporal information, i.e., LF videos. This paper proposes an end-to-end neural network architecture for depth estimation from 4D LF videos. This study also constructs a medium-scale synthetic 4D LF video dataset that can be used for training deep learning-based methods. Experimental results using synthetic and real-world 4D LF videos show that temporal information contributes to the improvement of depth estimation accuracy in noisy regions. Dataset and code is available at: this https URL

82.CIA-SSD: Confident IoU-Aware Single-Stage Object Detector From Point Cloud ⬇️

Existing single-stage detectors for locating objects in point clouds often treat object localization and category classification as separate tasks, so the localization accuracy and classification confidence may not well align. To address this issue, we present a new single-stage detector named the Confident IoU-Aware Single-Stage object Detector (CIA-SSD). First, we design the lightweight Spatial-Semantic Feature Aggregation module to adaptively fuse high-level abstract semantic features and low-level spatial features for accurate predictions of bounding boxes and classification confidence. Also, the predicted confidence is further rectified with our designed IoU-aware confidence rectification module to make the confidence more consistent with the localization accuracy. Based on the rectified confidence, we further formulate the Distance-variant IoU-weighted NMS to obtain smoother regressions and avoid redundant predictions. We experiment CIA-SSD on 3D car detection in the KITTI test set and show that it attains top performance in terms of the official ranking metric (moderate AP 80.28%) and above 32 FPS inference speed, outperforming all prior single-stage detectors. The code is available at this https URL.

83.ProMask: Probability Mask for Skeleton Detection ⬇️

Detecting object skeletons in natural images presents challenging, due to varied object scales, the complexity of backgrounds and various noises. The skeleton is a highly compressing shape representation, which can bring some essential advantages but cause the difficulties of detection. This skeleton line occupies a rare proportion of an image and is overly sensitive to spatial position. Inspired by these issues, we propose the ProMask, which is a novel skeleton detection model. The ProMask includes the probability mask and vector router. The skeleton probability mask representation explicitly encodes skeletons with segmentation signals, which can provide more supervised information to learn and pay more attention to ground-truth skeleton pixels. Moreover, the vector router module possesses two sets of orthogonal basis vectors in a two-dimensional space, which can dynamically adjust the predicted skeleton position. We evaluate our method on the well-known skeleton datasets, realizing the better performance than state-of-the-art approaches. Especially, ProMask significantly outperforms the competitive DeepFlux by 6.2% on the challenging SYM-PASCAL dataset. We consider that our proposed skeleton probability mask could serve as a solid baseline for future skeleton detection, since it is very effective and it requires about 10 lines of code.

84.Attention-Driven Dynamic Graph Convolutional Network for Multi-Label Image Recognition ⬇️

Recent studies often exploit Graph Convolutional Network (GCN) to model label dependencies to improve recognition accuracy for multi-label image recognition. However, constructing a graph by counting the label co-occurrence possibilities of the training data may degrade model generalizability, especially when there exist occasional co-occurrence objects in test images. Our goal is to eliminate such bias and enhance the robustness of the learnt features. To this end, we propose an Attention-Driven Dynamic Graph Convolutional Network (ADD-GCN) to dynamically generate a specific graph for each image. ADD-GCN adopts a Dynamic Graph Convolutional Network (D-GCN) to model the relation of content-aware category representations that are generated by a Semantic Attention Module (SAM). Extensive experiments on public multi-label benchmarks demonstrate the effectiveness of our method, which achieves mAPs of 85.2%, 96.0%, and 95.5% on MS-COCO, VOC2007, and VOC2012, respectively, and outperforms current state-of-the-art methods with a clear margin. All codes can be found at this https URL.

85.Spatially-Adaptive Pixelwise Networks for Fast Image Translation ⬇️

We introduce a new generator architecture, aimed at fast and efficient high-resolution image-to-image translation. We design the generator to be an extremely lightweight function of the full-resolution image. In fact, we use pixel-wise networks; that is, each pixel is processed independently of others, through a composition of simple affine transformations and nonlinearities. We take three important steps to equip such a seemingly simple function with adequate expressivity. First, the parameters of the pixel-wise networks are spatially varying so they can represent a broader function class than simple 1x1 convolutions. Second, these parameters are predicted by a fast convolutional network that processes an aggressively low-resolution representation of the input; Third, we augment the input image with a sinusoidal encoding of spatial coordinates, which provides an effective inductive bias for generating realistic novel high-frequency image content. As a result, our model is up to 18x faster than state-of-the-art baselines. We achieve this speedup while generating comparable visual quality across different image resolutions and translation domains.

86.Multi Scale Temporal Graph Networks For Skeleton-based Action Recognition ⬇️

Graph convolutional networks (GCNs) can effectively capture the features of related nodes and improve the performance of the model. More attention is paid to employing GCN in Skeleton-Based action recognition. But existing methods based on GCNs have two problems. First, the consistency of temporal and spatial features is ignored for extracting features node by node and frame by frame. To obtain spatiotemporal features simultaneously, we design a generic representation of skeleton sequences for action recognition and propose a novel model called Temporal Graph Networks (TGN). Secondly, the adjacency matrix of the graph describing the relation of joints is mostly dependent on the physical connection between joints. To appropriately describe the relations between joints in the skeleton graph, we propose a multi-scale graph strategy, adopting a full-scale graph, part-scale graph, and core-scale graph to capture the local features of each joint and the contour features of important joints. Experiments were carried out on two large datasets and results show that TGN with our graph strategy outperforms state-of-the-art methods.

87.FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding ⬇️

Visual scene understanding is the core task in making any crucial decision in any computer vision system. Although popular computer vision datasets like Cityscapes, MS-COCO, PASCAL provide good benchmarks for several tasks (e.g. image classification, segmentation, object detection), these datasets are hardly suitable for post disaster damage assessments. On the other hand, existing natural disaster datasets include mainly satellite imagery which have low spatial resolution and a high revisit period. Therefore, they do not have a scope to provide quick and efficient damage assessment tasks. Unmanned Aerial Vehicle(UAV) can effortlessly access difficult places during any disaster and collect high resolution imagery that is required for aforementioned tasks of computer vision. To address these issues we present a high resolution UAV imagery, FloodNet, captured after the hurricane Harvey. This dataset demonstrates the post flooded damages of the affected areas. The images are labeled pixel-wise for semantic segmentation task and questions are produced for the task of visual question answering. FloodNet poses several challenges including detection of flooded roads and buildings and distinguishing between natural water and flooded water. With the advancement of deep learning algorithms, we can analyze the impact of any disaster which can make a precise understanding of the affected areas. In this paper, we compare and contrast the performances of baseline methods for image classification, semantic segmentation, and visual question answering on our dataset.

88.Cirrus: A Long-range Bi-pattern LiDAR Dataset ⬇️

In this paper, we introduce Cirrus, a new long-range bi-pattern LiDAR public dataset for autonomous driving tasks such as 3D object detection, critical to highway driving and timely decision making. Our platform is equipped with a high-resolution video camera and a pair of LiDAR sensors with a 250-meter effective range, which is significantly longer than existing public datasets. We record paired point clouds simultaneously using both Gaussian and uniform scanning patterns. Point density varies significantly across such a long range, and different scanning patterns further diversify object representation in LiDAR. In Cirrus, eight categories of objects are exhaustively annotated in the LiDAR point clouds for the entire effective range. To illustrate the kind of studies supported by this new dataset, we introduce LiDAR model adaptation across different ranges, scanning patterns, and sensor devices. Promising results show the great potential of this new dataset to the robotics and computer vision communities.

89.Multi-head Knowledge Distillation for Model Compression ⬇️

Several methods of knowledge distillation have been developed for neural network compression. While they all use the KL divergence loss to align the soft outputs of the student model more closely with that of the teacher, the various methods differ in how the intermediate features of the student are encouraged to match those of the teacher. In this paper, we propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features, which we refer to as multi-head knowledge distillation (MHKD). We add loss terms for training the student that measure the dissimilarity between student and teacher outputs of the auxiliary classifiers. At the same time, the proposed method also provides a natural way to measure differences at the intermediate layers even though the dimensions of the internal teacher and student features may be different. Through several experiments in image classification on multiple datasets we show that the proposed method outperforms prior relevant approaches presented in the literature.

90.Cosine-Pruned Medial Axis: A new method for isometric equivariant and noise-free medial axis extraction ⬇️

We present the CPMA, a new method for medial axis pruning with noise robustness and equivariance to isometric transformations. Our method leverages the discrete cosine transform to create smooth versions of a shape $\Omega$. We use the smooth shapes to compute a score function $\scorefunction$ that filters out spurious branches from the medial axis. We extensively compare the CPMA with state-of-the-art pruning methods and highlight our method's noise robustness and isometric equivariance. We found that our pruning approach achieves competitive results and yields stable medial axes even in scenarios with significant contour perturbations.

91.Knowledge Distillation Thrives on Data Augmentation ⬇️

Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model. Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far. In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not. We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA. By this explanation, we propose to enhance KD via a stronger data augmentation scheme (e.g., mixup, CutMix). Furthermore, an even stronger new DA approach is developed specifically for KD based on the idea of active learning. The findings and merits of the proposed method are validated by extensive experiments on CIFAR-100, Tiny ImageNet, and ImageNet datasets. We can achieve improved performance simply by using the original KD loss combined with stronger augmentation schemes, compared to existing state-of-the-art methods, which employ more advanced distillation losses. In addition, when our approaches are combined with more advanced distillation losses, we can advance the state-of-the-art performance even more. On top of the encouraging performance, this paper also sheds some light on explaining the success of knowledge distillation. The discovered interplay between KD and DA may inspire more advanced KD algorithms.

92.Driver Glance Classification In-the-wild: Towards Generalization Across Domains and Subjects ⬇️

Distracted drivers are dangerous drivers. Equipping advanced driver assistance systems (ADAS) with the ability to detect driver distraction can help prevent accidents and improve driver safety. In order to detect driver distraction, an ADAS must be able to monitor their visual attention. We propose a model that takes as input a patch of the driver's face along with a crop of the eye-region and classifies their glance into 6 coarse regions-of-interest (ROIs) in the vehicle. We demonstrate that an hourglass network, trained with an additional reconstruction loss, allows the model to learn stronger contextual feature representations than a traditional encoder-only classification module. To make the system robust to subject-specific variations in appearance and behavior, we design a personalized hourglass model tuned with an auxiliary input representing the driver's baseline glance behavior. Finally, we present a weakly supervised multi-domain training regimen that enables the hourglass to jointly learn representations from different domains (varying in camera type, angle), utilizing unlabeled samples and thereby reducing annotation cost.

93.Automated Calibration of Mobile Cameras for 3D Reconstruction of Mechanical Pipes ⬇️

This manuscript provides a new framework for calibration of optical instruments, in particular mobile cameras, using large-scale circular black and white target fields. New methods were introduced for (i) matching targets between images; (ii) adjusting the systematic eccentricity error of target centers; and (iii) iteratively improving the calibration solution through a free-network self-calibrating bundle adjustment. It was observed that the proposed target matching effectively matched circular targets in 270 mobile phone images from a complete calibration laboratory with robustness to Type II errors. The proposed eccentricity adjustment, which requires only camera projective matrices from two views, behaved synonymous to available closed-form solutions, which require several additional object space target information a priori. Finally, specifically for the case of the mobile devices, the calibration parameters obtained using our framework was found superior compared to in-situ calibration for estimating the 3D reconstructed radius of a mechanical pipe (approximately 45% improvement).

94.Discovering Underground Maps from Fashion ⬇️

The fashion sense -- meaning the clothing styles people wear -- in a geographical region can reveal information about that region. For example, it can reflect the kind of activities people do there, or the type of crowds that frequently visit the region (e.g., tourist hot spot, student neighborhood, business center). We propose a method to automatically create underground neighborhood maps of cities by analyzing how people dress. Using publicly available images from across a city, our method finds neighborhoods with a similar fashion sense and segments the map without supervision. For 37 cities worldwide, we show promising results in creating good underground maps, as evaluated using experiments with human judges and underground map benchmarks derived from non-image data. Our approach further allows detecting distinct neighborhoods (what is the most unique region of LA?) and answering analogy questions between cities (what is the "Downtown LA" of Bogota?).

95.MPG: A Multi-ingredient Pizza Image Generator with Conditional StyleGANs ⬇️

Multilabel conditional image generation is a challenging problem in computer vision. In this work we propose Multi-ingredient Pizza Generator (MPG), a conditional Generative Neural Network (GAN) framework for synthesizing multilabel images. We design MPG based on a state-of-the-art GAN structure called StyleGAN2, in which we develop a new conditioning technique by enforcing intermediate feature maps to learn scalewise label information. Because of the complex nature of the multilabel image generation problem, we also regularize synthetic image by predicting the corresponding ingredients as well as encourage the discriminator to distinguish between matched image and mismatched image. To verify the efficacy of MPG, we test it on Pizza10, which is a carefully annotated multi-ingredient pizza image dataset. MPG can successfully generate photo-realist pizza images with desired ingredients. The framework can be easily extend to other multilabel image generation scenarios.

96.Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification ⬇️

Bayesian neural networks (BNNs) have been long considered an ideal, yet unscalable solution for improving the robustness and the predictive uncertainty of deep neural networks. While they could capture more accurately the posterior distribution of the network parameters, most BNN approaches are either limited to small networks or rely on constraining assumptions such as parameter independence. These drawbacks have enabled prominence of simple, but computationally heavy approaches such as Deep Ensembles, whose training and testing costs increase linearly with the number of networks. In this work we aim for efficient deep BNNs amenable to complex computer vision architectures, e.g. ResNet50 DeepLabV3+, and tasks, e.g. semantic segmentation, with fewer assumptions on the parameters. We achieve this by leveraging variational autoencoders (VAEs) to learn the interaction and the latent distribution of the parameters at each network layer. Our approach, Latent-Posterior BNN (LP-BNN), is compatible with the recent BatchEnsemble method, leading to highly efficient ({in terms of computation and} memory during both training and testing) ensembles. LP-BNN s attain competitive results across multiple metrics in several challenging benchmarks for image classification, semantic segmentation and out-of-distribution detection.

97.The Role of Regularization in Shaping Weight and Node Pruning Dependency and Dynamics ⬇️

The pressing need to reduce the capacity of deep neural networks has stimulated the development of network dilution methods and their analysis. While the ability of $L_1$ and $L_0$ regularization to encourage sparsity is often mentioned, $L_2$ regularization is seldom discussed in this context. We present a novel framework for weight pruning by sampling from a probability function that favors the zeroing of smaller weights. In addition, we examine the contribution of $L_1$ and $L_2$ regularization to the dynamics of node pruning while optimizing for weight pruning. We then demonstrate the effectiveness of the proposed stochastic framework when used together with a weight decay regularizer on popular classification models in removing 50% of the nodes in an MLP for MNIST classification, 60% of the filters in VGG-16 for CIFAR10 classification, and on medical image models in removing 60% of the channels in a U-Net for instance segmentation and 50% of the channels in CNN model for COVID-19 detection. For these node-pruned networks, we also present competitive weight pruning results that are only slightly less accurate than the original, dense networks.

98.Perspectives on Sim2Real Transfer for Robotics: A Summary of the R:SS 2020 Workshop ⬇️

This report presents the debates, posters, and discussions of the Sim2Real workshop held in conjunction with the 2020 edition of the "Robotics: Science and System" conference. Twelve leaders of the field took competing debate positions on the definition, viability, and importance of transferring skills from simulation to the real world in the context of robotics problems. The debaters also joined a large panel discussion, answering audience questions and outlining the future of Sim2Real in robotics. Furthermore, we invited extended abstracts to this workshop which are summarized in this report. Based on the workshop, this report concludes with directions for practitioners exploiting this technology and for researchers further exploring open problems in this area.

99.Triplet Entropy Loss: Improving The Generalisation of Short Speech Language Identification Systems ⬇️

We present several methods to improve the generalisation of language identification (LID) systems to new speakers and to new domains. These methods involve Spectral augmentation, where spectrograms are masked in the frequency or time bands during training and CNN architectures that are pre-trained on the Imagenet dataset. The paper also introduces the novel Triplet Entropy Loss training method, which involves training a network simultaneously using Cross Entropy and Triplet loss. It was found that all three methods improved the generalisation of the models, though not significantly. Even though the models trained using Triplet Entropy Loss showed a better understanding of the languages and higher accuracies, it appears as though the models still memorise word patterns present in the spectrograms rather than learning the finer nuances of a language. The research shows that Triplet Entropy Loss has great potential and should be investigated further, not only in language identification tasks but any classification task.

100.Overcoming Barriers to Data Sharing with Medical Image Generation: A Comprehensive Evaluation ⬇️

Privacy concerns around sharing personally identifiable information are a major practical barrier to data sharing in medical research. However, in many cases, researchers have no interest in a particular individual's information but rather aim to derive insights at the level of cohorts. Here, we utilize Generative Adversarial Networks (GANs) to create derived medical imaging datasets consisting entirely of synthetic patient data. The synthetic images ideally have, in aggregate, similar statistical properties to those of a source dataset but do not contain sensitive personal information. We assess the quality of synthetic data generated by two GAN models for chest radiographs with 14 different radiology findings and brain computed tomography (CT) scans with six types of intracranial hemorrhages. We measure the synthetic image quality by the performance difference of predictive models trained on either the synthetic or the real dataset. We find that synthetic data performance disproportionately benefits from a reduced number of unique label combinations and determine at what number of samples per class overfitting effects start to dominate GAN training. Our open-source benchmark findings also indicate that synthetic data generation can benefit from higher levels of spatial resolution. We additionally conducted a reader study in which trained radiologists do not perform better than random on discriminating between synthetic and real medical images for both data modalities to a statistically significant extent. Our study offers valuable guidelines and outlines practical conditions under which insights derived from synthetic medical images are similar to those that would have been derived from real imaging data. Our results indicate that synthetic data sharing may be an attractive and privacy-preserving alternative to sharing real patient-level data in the right settings.

101.Learning Tactile Models for Factor Graph-based State Estimation ⬇️

We address the problem of estimating object pose from touch during manipulation under occlusion. Vision-based tactile sensors provide rich, local measurements at the point of contact. A single such measurement, however, contains limited information and multiple measurements are needed to infer latent object state. We solve this inference problem using a factor graph. In order to incorporate tactile measurements in the graph, we need local observation models that can map high-dimensional tactile images onto a low-dimensional state space. Prior work has used low-dimensional force measurements or hand-designed functions to interpret tactile measurements. These methods, however, can be brittle and difficult to scale across objects and sensors. Our key insight is to directly learn tactile observation models that predict the relative pose of the sensor given a pair of tactile images. These relative poses can then be incorporated as factors within a factor graph. We propose a two-stage approach: first we learn local tactile observation models supervised with ground truth data, and then integrate these models along with physics and geometric factors within a factor graph optimizer. We demonstrate reliable object tracking using only tactile feedback for over 150 real-world planar pushing sequences with varying trajectories across three object shapes. Supplementary video: this https URL

102.Multi-Decoder Networks with Multi-Denoising Inputs for Tumor Segmentation ⬇️

Automatic segmentation of brain glioma from multimodal MRI scans plays a key role in clinical trials and practice. Unfortunately, manual segmentation is very challenging, time-consuming, costly, and often inaccurate despite human expertise due to the high variance and high uncertainty in the human annotations. In the present work, we develop an end-to-end deep-learning-based segmentation method using a multi-decoder architecture by jointly learning three separate sub-problems using a partly shared encoder. We also propose to apply smoothing methods to the input images to generate denoised versions as additional inputs to the network. The validation performance indicate an improvement when using the proposed method. The proposed method was ranked 2nd in the task of Quantification of Uncertainty in Segmentation in the Brain Tumors in Multimodal Magnetic Resonance Imaging Challenge 2020.

103.Learning normal appearance for fetal anomaly screening: Application to the unsupervised detection of Hypoplastic Left Heart Syndrome ⬇️

Congenital heart disease is considered as one the most common groups of congenital malformations which affects $6-11$ per $1000$ newborns. In this work, an automated framework for detection of cardiac anomalies during ultrasound screening is proposed and evaluated on the example of Hypoplastic Left Heart Syndrome (HLHS), a sub-category of congenital heart disease. We propose an unsupervised approach that learns healthy anatomy exclusively from clinically confirmed normal control patients. We evaluate a number of known anomaly detection frameworks together with a new model architecture based on the $\alpha$-GAN network and find evidence that the proposed model performs significantly better than the state-of-the-art in image-based anomaly detection, yielding average $0.81$ AUC \emph{and} a better robustness towards initialisation compared to previous works.

104.Binary Segmentation of Seismic Facies Using Encoder-Decoder Neural Networks ⬇️

The interpretation of seismic data is vital for characterizing sediments' shape in areas of geological study. In seismic interpretation, deep learning becomes useful for reducing the dependence on handcrafted facies segmentation geometry and the time required to study geological areas. This work presents a Deep Neural Network for Facies Segmentation (DNFS) to obtain state-of-the-art results for seismic facies segmentation. DNFS is trained using a combination of cross-entropy and Jaccard loss functions. Our results show that DNFS obtains highly detailed predictions for seismic facies segmentation using fewer parameters than StNet and U-Net.

105.Efficient Medical Image Segmentation with Intermediate Supervision Mechanism ⬇️

Because the expansion path of U-Net may ignore the characteristics of small targets, intermediate supervision mechanism is proposed. The original mask is also entered into the network as a label for intermediate output. However, U-Net is mainly engaged in segmentation, and the extracted features are also targeted at segmentation location information, and the input and output are different. The label we need is that the input and output are both original masks, which is more similar to the refactoring process, so we propose another intermediate supervision mechanism. However, the features extracted by the contraction path of this intermediate monitoring mechanism are not necessarily consistent. For example, U-Net's contraction path extracts transverse features, while auto-encoder extracts longitudinal features, which may cause the output of the expansion path to be inconsistent with the label. Therefore, we put forward the intermediate supervision mechanism of shared-weight decoder module. Although the intermediate supervision mechanism improves the segmentation accuracy, the training time is too long due to the extra input and multiple loss functions. For one of these problems, we have introduced tied-weight decoder. To reduce the redundancy of the model, we combine shared-weight decoder module with tied-weight decoder module.

106.Impact of Power Supply Noise on Image Sensor Performance in Automotive Applications ⬇️

Vision Systems are quickly becoming a large component of Active Automotive Safety Systems. In order to be effective in critical safety applications these systems must produce high quality images in both daytime and night-time scenarios in order to provide the large informational content required for software analysis in applications such as lane departure, pedestrian detection and collision detection. The challenge in capturing high quality images in low light scenarios is that the signal to noise ratio is greatly reduced, which can result in noise becoming the dominant factor in a captured image, thereby making these safety systems less effective at night. Research has been undertaken to develop a systematic method of characterising image sensor performance in response to electrical noise in order to improve the design and performance of automotive cameras in low light scenarios. The root cause of image row noise has been established and a mathematical algorithm for determining the magnitude of row noise in an image has been devised. An automated characterisation method has been developed to allow performance characterisation in response to a large frequency spectrum of electrical noise on the image sensor power supply. Various strategies of improving image sensor performance for low light applications have also been proposed from the research outcomes.

107.Deep Metric Learning-based Image Retrieval System for Chest Radiograph and its Clinical Applications in COVID-19 ⬇️

In recent years, deep learning-based image analysis methods have been widely applied in computer-aided detection, diagnosis and prognosis, and has shown its value during the public health crisis of the novel coronavirus disease 2019 (COVID-19) pandemic. Chest radiograph (CXR) has been playing a crucial role in COVID-19 patient triaging, diagnosing and monitoring, particularly in the United States. Considering the mixed and unspecific signals in CXR, an image retrieval model of CXR that provides both similar images and associated clinical information can be more clinically meaningful than a direct image diagnostic model. In this work we develop a novel CXR image retrieval model based on deep metric learning. Unlike traditional diagnostic models which aims at learning the direct mapping from images to labels, the proposed model aims at learning the optimized embedding space of images, where images with the same labels and similar contents are pulled together. It utilizes multi-similarity loss with hard-mining sampling strategy and attention mechanism to learn the optimized embedding space, and provides similar images to the query image. The model is trained and validated on an international multi-site COVID-19 dataset collected from 3 different sources. Experimental results of COVID-19 image retrieval and diagnosis tasks show that the proposed model can serve as a robust solution for CXR analysis and patient management for COVID-19. The model is also tested on its transferability on a different clinical decision support task, where the pre-trained model is applied to extract image features from a new dataset without any further training. These results demonstrate our deep metric learning based image retrieval model is highly efficient in the CXR retrieval, diagnosis and prognosis, and thus has great clinical value for the treatment and management of COVID-19 patients.

108.Noise2Kernel: Adaptive Self-Supervised Blind Denoising using a Dilated Convolutional Kernel Architecture ⬇️

With the advent of recent advances in unsupervised learning, efficient training of a deep network for image denoising without pairs of noisy and clean images has become feasible. However, most current unsupervised denoising methods are built on the assumption of zero-mean noise under the signal-independent condition. This assumption causes blind denoising techniques to suffer brightness shifting problems on images that are greatly corrupted by extreme noise such as salt-and-pepper noise. Moreover, most blind denoising methods require a random masking scheme for training to ensure the invariance of the denoising process. In this paper, we propose a dilated convolutional network that satisfies an invariant property, allowing efficient kernel-based training without random masking. We also propose an adaptive self-supervision loss to circumvent the requirement of zero-mean constraint, which is specifically effective in removing salt-and-pepper or hybrid noise where a prior knowledge of noise statistics is not readily available. We demonstrate the efficacy of the proposed method by comparing it with state-of-the-art denoising methods using various examples.

109.Efficient Kernel based Matched Filter Approach for Segmentation of Retinal Blood Vessels ⬇️

Retinal blood vessels structure contains information about diseases like obesity, diabetes, hypertension and glaucoma. This information is very useful in identification and treatment of these fatal diseases. To obtain this information, there is need to segment these retinal vessels. Many kernel based methods have been given for segmentation of retinal vessels but their kernels are not appropriate to vessel profile cause poor performance. To overcome this, a new and efficient kernel based matched filter approach has been proposed. The new matched filter is used to generate the matched filter response (MFR) image. We have applied Otsu thresholding method on obtained MFR image to extract the vessels. We have conducted extensive experiments to choose best value of parameters for the proposed matched filter kernel. The proposed approach has examined and validated on two online available DRIVE and STARE datasets. The proposed approach has specificity 98.50%, 98.23% and accuracy 95.77 %, 95.13% for DRIVE and STARE dataset respectively. Obtained results confirm that the proposed method has better performance than others. The reason behind increased performance is due to appropriate proposed kernel which matches retinal blood vessel profile more accurately.

110.Self-Supervision Closes the Gap Between Weak and Strong Supervision in Histology ⬇️

One of the biggest challenges for applying machine learning to histopathology is weak supervision: whole-slide images have billions of pixels yet often only one global label. The state of the art therefore relies on strongly-supervised model training using additional local annotations from domain experts. However, in the absence of detailed annotations, most weakly-supervised approaches depend on a frozen feature extractor pre-trained on ImageNet. We identify this as a key weakness and propose to train an in-domain feature extractor on histology images using MoCo v2, a recent self-supervised learning algorithm. Experimental results on Camelyon16 and TCGA show that the proposed extractor greatly outperforms its ImageNet counterpart. In particular, our results improve the weakly-supervised state of the art on Camelyon16 from 91.4% to 98.7% AUC, thereby closing the gap with strongly-supervised models that reach 99.3% AUC. Through these experiments, we demonstrate that feature extractors trained via self-supervised learning can act as drop-in replacements to significantly improve existing machine learning techniques in histology. Lastly, we show that the learned embedding space exhibits biologically meaningful separation of tissue structures.

111.DIPPAS: A Deep Image Prior PRNU Anonymization Scheme ⬇️

Source device identification is an important topic in image forensics since it allows to trace back the origin of an image. Its forensics counter-part is source device anonymization, that is, to mask any trace on the image that can be useful for identifying the source device. A typical trace exploited for source device identification is the Photo Response Non-Uniformity (PRNU), a noise pattern left by the device on the acquired images. In this paper, we devise a methodology for suppressing such a trace from natural images without significant impact on image quality. Specifically, we turn PRNU anonymization into an optimization problem in a Deep Image Prior (DIP) framework. In a nutshell, a Convolutional Neural Network (CNN) acts as generator and returns an image that is anonymized with respect to the source PRNU, still maintaining high visual quality. With respect to widely-adopted deep learning paradigms, our proposed CNN is not trained on a set of input-target pairs of images. Instead, it is optimized to reconstruct the PRNU-free image from the original image under analysis itself. This makes the approach particularly suitable in scenarios where large heterogeneous databases are analyzed and prevents any problem due to lack of generalization. Through numerical examples on publicly available datasets, we prove our methodology to be effective compared to state-of-the-art techniques.

112.Robustness Investigation on Deep Learning CT Reconstruction for Real-Time Dose Optimization ⬇️

In computed tomography (CT), automatic exposure control (AEC) is frequently used to reduce radiation dose exposure to patients. For organ-specific AEC, a preliminary CT reconstruction is necessary to estimate organ shapes for dose optimization, where only a few projections are allowed for real-time reconstruction. In this work, we investigate the performance of automated transform by manifold approximation (AUTOMAP) in such applications. For proof of concept, we investigate its performance on the MNIST dataset first, where the dataset containing all the 10 digits are randomly split into a training set and a test set. We train the AUTOMAP model for image reconstruction from 2 projections or 4 projections directly. The test results demonstrate that AUTOMAP is able to reconstruct most digits well with a false rate of 1.6% and 6.8% respectively. In our subsequent experiment, the MNIST dataset is split in a way that the training set contains 9 digits only while the test set contains the excluded digit only, for instance "2". In the test results, the digit "2"s are falsely predicted as "3" or "5" when using 2 projections for reconstruction, reaching a false rate of 94.4%. For the application in medical images, AUTOMAP is also trained on patients' CT images. The test images reach an average root-mean-square error of 290 HU. Although the coarse body outlines are well reconstructed, some organs are misshaped.

113.Backpropagating Linearly Improves Transferability of Adversarial Examples ⬇️

The vulnerability of deep neural networks (DNNs) to adversarial examples has drawn great attention from the community. In this paper, we study the transferability of such examples, which lays the foundation of many black-box attacks on DNNs. We revisit a not so new but definitely noteworthy hypothesis of Goodfellow et al.'s and disclose that the transferability can be enhanced by improving the linearity of DNNs in an appropriate manner. We introduce linear backpropagation (LinBP), a method that performs backpropagation in a more linear fashion using off-the-shelf attacks that exploit gradients. More specifically, it calculates forward as normal but backpropagates loss as if some nonlinear activations are not encountered in the forward pass. Experimental results demonstrate that this simple yet effective method obviously outperforms current state-of-the-arts in crafting transferable adversarial examples on CIFAR-10 and ImageNet, leading to more effective attacks on a variety of DNNs.

114.An Approach to Intelligent Pneumonia Detection and Integration ⬇️

Each year, over 2.5 million people, most of them in developed countries, die from pneumonia [1]. Since many studies have proved pneumonia is successfully treatable when timely and correctly diagnosed, many of diagnosis aids have been developed, with AI-based methods achieving high accuracies [2]. However, currently, the usage of AI in pneumonia detection is limited, in particular, due to challenges in generalizing a locally achieved result. In this report, we propose a roadmap for creating and integrating a system that attempts to solve this challenge. We also address various technical, legal, ethical, and logistical issues, with a blueprint of possible solutions.

115.Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements ⬇️

We propose a novel system that takes as an input body movements of a musician playing a musical instrument and generates music in an unsupervised setting. Learning to generate multi-instrumental music from videos without labeling the instruments is a challenging problem. To achieve the transformation, we built a pipeline named 'Multi-instrumentalistNet' (MI Net). At its base, the pipeline learns a discrete latent representation of various instruments music from log-spectrogram using a Vector Quantized Variational Autoencoder (VQ-VAE) with multi-band residual blocks. The pipeline is then trained along with an autoregressive prior conditioned on the musician's body keypoints movements encoded by a recurrent neural network. Joint training of the prior with the body movements encoder succeeds in the disentanglement of the music into latent features indicating the musical components and the instrumental features. The latent space results in distributions that are clustered into distinct instruments from which new music can be generated. Furthermore, the VQ-VAE architecture supports detailed music generation with additional conditioning. We show that a Midi can further condition the latent space such that the pipeline will generate the exact content of the music being played by the instrument in the video. We evaluate MI Net on two datasets containing videos of 13 instruments and obtain generated music of reasonable audio quality, easily associated with the corresponding instrument, and consistent with the music audio content.

116.Multivariate Density Estimation with Deep Neural Mixture Models ⬇️

Albeit worryingly underrated in the recent literature on machine learning in general (and, on deep learning in particular), multivariate density estimation is a fundamental task in many applications, at least implicitly, and still an open issue. With a few exceptions, deep neural networks (DNNs) have seldom been applied to density estimation, mostly due to the unsupervised nature of the estimation task, and (especially) due to the need for constrained training algorithms that ended up realizing proper probabilistic models that satisfy Kolmogorov's axioms. Moreover, in spite of the well-known improvement in terms of modeling capabilities yielded by mixture models over plain single-density statistical estimators, no proper mixtures of multivariate DNN-based component densities have been investigated so far. The paper fills this gap by extending our previous work on Neural Mixture Densities (NMMs) to multivariate DNN mixtures. A maximum-likelihood (ML) algorithm for estimating Deep NMMs (DNMMs) is handed out, which satisfies numerically a combination of hard and soft constraints aimed at ensuring satisfaction of Kolmogorov's axioms. The class of probability density functions that can be modeled to any degree of precision via DNMMs is formally defined. A procedure for the automatic selection of the DNMM architecture, as well as of the hyperparameters for its ML training algorithm, is presented (exploiting the probabilistic nature of the DNMM). Experimental results on univariate and multivariate data are reported on, corroborating the effectiveness of the approach and its superiority to the most popular statistical estimation techniques.

117.Spatio-Temporal Graph Scattering Transform ⬇️

Although spatio-temporal graph neural networks have achieved great empirical success in handling multiple correlated time series, they may be impractical in some real-world scenarios due to a lack of sufficient high-quality training data. Furthermore, spatio-temporal graph neural networks lack theoretical interpretation. To address these issues, we put forth a novel mathematically designed framework to analyze spatio-temporal data. Our proposed spatio-temporal graph scattering transform (ST-GST) extends traditional scattering transforms to the spatio-temporal domain. It performs iterative applications of spatio-temporal graph wavelets and nonlinear activation functions, which can be viewed as a forward pass of spatio-temporal graph convolutional networks without training. Since all the filter coefficients in ST-GST are mathematically designed, it is promising for the real-world scenarios with limited training data, and also allows for a theoretical analysis, which shows that the proposed ST-GST is stable to small perturbations of input signals and structures. Finally, our experiments show that i) ST-GST outperforms spatio-temporal graph convolutional networks by an increase of 35% in accuracy for MSR Action3D dataset; ii) it is better and computationally more efficient to design the transform based on separable spatio-temporal graphs than the joint ones; and iii) the nonlinearity in ST-GST is critical to empirical performance.

118.An Uncertainty-Driven GCN Refinement Strategy for Organ Segmentation ⬇️

Organ segmentation in CT volumes is an important pre-processing step in many computer assisted intervention and diagnosis methods. In recent years, convolutional neural networks have dominated the state of the art in this task. However, since this problem presents a challenging environment due to high variability in the organ's shape and similarity between tissues, the generation of false negative and false positive regions in the output segmentation is a common issue. Recent works have shown that the uncertainty analysis of the model can provide us with useful information about potential errors in the segmentation. In this context, we proposed a segmentation refinement method based on uncertainty analysis and graph convolutional networks. We employ the uncertainty levels of the convolutional network in a particular input volume to formulate a semi-supervised graph learning problem that is solved by training a graph convolutional network. To test our method we refine the initial output of a 2D U-Net. We validate our framework with the NIH pancreas dataset and the spleen dataset of the medical segmentation decathlon. We show that our method outperforms the state-of-the-art CRF refinement method by improving the dice score by 1% for the pancreas and 2% for spleen, with respect to the original U-Net's prediction. Finally, we perform a sensitivity analysis on the parameters of our proposal and discuss the applicability to other CNN architectures, the results, and current limitations of the model for future work in this research direction. For reproducibility purposes, we make our code publicly available at this https URL.

119.Global Unifying Intrinsic Calibration for Spinning and Solid-State LiDARs ⬇️

Sensor calibration, which can be intrinsic or extrinsic, is an essential step to achieve the measurement accuracy required for modern perception and navigation systems deployed on autonomous robots. To date, intrinsic calibration models for spinning LiDARs have been based on hypothesized based on their physical mechanisms, resulting in anywhere from three to ten parameters to be estimated from data, while no phenomenological models have yet been proposed for solid-state LiDARs. Instead of going down that road, we propose to abstract away from the physics of a LiDAR type (spinning vs solid-state, for example), and focus on the spatial geometry of the point cloud generated by the sensor. By modeling the calibration parameters as an element of a special matrix Lie Group, we achieve a unifying view of calibration for different types of LiDARs. We further prove mathematically that the proposed model is well-constrained (has a unique answer) given four appropriately orientated targets. The proof provides a guideline for target positioning in the form of a tetrahedron. Moreover, an existing Semidefinite programming global solver for SE(3) can be modified to compute efficiently the optimal calibration parameters. For solid state LiDARs, we illustrate how the method works in simulation. For spinning LiDARs, we show with experimental data that the proposed matrix Lie Group model performs equally well as physics-based models in terms of reducing the P2P distance, while being more robust to noise.

120.CoEdge: Cooperative DNN Inference with Adaptive Workload Partitioning over Heterogeneous Edge Devices ⬇️

Recent advances in artificial intelligence have driven increasing intelligent applications at the network edge, such as smart home, smart factory, and smart city. To deploy computationally intensive Deep Neural Networks (DNNs) on resource-constrained edge devices, traditional approaches have relied on either offloading workload to the remote cloud or optimizing computation at the end device locally. However, the cloud-assisted approaches suffer from the unreliable and delay-significant wide-area network, and the local computing approaches are limited by the constrained computing capability. Towards high-performance edge intelligence, the cooperative execution mechanism offers a new paradigm, which has attracted growing research interest recently. In this paper, we propose CoEdge, a distributed DNN computing system that orchestrates cooperative DNN inference over heterogeneous edge devices. CoEdge utilizes available computation and communication resources at the edge and dynamically partitions the DNN inference workload adaptive to devices' computing capabilities and network conditions. Experimental evaluations based on a realistic prototype show that CoEdge outperforms status-quo approaches in saving energy with close inference latency, achieving up to 25.5%~66.9% energy reduction for four widely-adopted CNN models.

121.Learning to Reduce Defocus Blur by Realistically Modeling Dual-Pixel Data ⬇️

Recent work has shown impressive results on data-driven defocus deblurring using the two-image views available on modern dual-pixel (DP) sensors. One significant challenge in this line of research is access to DP data. Despite many cameras having DP sensors, only a limited number provide access to the low-level DP sensor images. In addition, capturing training data for defocus deblurring involves a time-consuming and tedious setup requiring the camera's aperture to be adjusted. Some cameras with DP sensors (e.g., smartphones) do not have adjustable apertures, further limiting the ability to produce the necessary training data. We address the data capture bottleneck by proposing a procedure to generate realistic DP data synthetically. Our synthesis approach mimics the optical image formation found on DP sensors and can be applied to virtual scenes rendered with standard computer software. Leveraging these realistic synthetic DP images, we introduce a new recurrent convolutional network (RCN) architecture that can improve deblurring results and is suitable for use with single-frame and multi-frame data captured by DP sensors. Finally, we show that our synthetic DP data is useful for training DNN models targeting video deblurring applications where access to DP data remains challenging.

122.Esophageal Tumor Segmentation in CT Images using a 3D Convolutional Neural Network ⬇️

Manual or automatic delineation of the esophageal tumor in CT images is known to be very challenging. This is due to the low contrast between the tumor and adjacent tissues, the anatomical variation of the esophagus, as well as the occasional presence of foreign bodies (e.g. feeding tubes). Physicians therefore usually exploit additional knowledge such as endoscopic findings, clinical history, additional imaging modalities like PET scans. Achieving his additional information is time-consuming, while the results are error-prone and might lead to non-deterministic results. In this paper we aim to investigate if and to what extent a simplified clinical workflow based on CT alone, allows one to automatically segment the esophageal tumor with sufficient quality. For this purpose, we present a fully automatic end-to-end esophageal tumor segmentation method based on convolutional neural networks (CNNs). The proposed network, called Dilated Dense Attention Unet (DDAUnet), leverages spatial and channel attention gates in each dense block to selectively concentrate on determinant feature maps and regions. Dilated convolutional layers are used to manage GPU memory and increase the network receptive field. We collected a dataset of 792 scans from 288 distinct patients including varying anatomies with \mbox{air pockets}, feeding tubes and proximal tumors. Repeatability and reproducibility studies were conducted for three distinct splits of training and validation sets. The proposed network achieved a $\mathrm{DSC}$ value of $0.79 \pm 0.20$, a mean surface distance of $5.4 \pm 20.2mm$ and $95%$ Hausdorff distance of $14.7 \pm 25.0mm$ for 287 test scans, demonstrating promising results with a simplified clinical workflow based on CT alone. Our code is publicly available via \url{this https URL}.

123.Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification ⬇️

Deep AUC Maximization (DAM) is a paradigm for learning a deep neural network by maximizing the AUC score of the model on a dataset. Most previous works of AUC maximization focus on the perspective of optimization by designing efficient stochastic algorithms, and studies on generalization performance of DAM on difficult tasks are missing. In this work, we aim to make DAM more practical for interesting real-world applications (e.g., medical image classification). First, we propose a new margin-based surrogate loss function for the AUC score (named as the AUC margin loss). It is more robust than the commonly used AUC square loss, while enjoying the same advantage in terms of large-scale stochastic optimization. Second, we conduct empirical studies of our DAM method on difficult medical image classification tasks, namely classification of chest x-ray images for identifying many threatening diseases and classification of images of skin lesions for identifying melanoma. Our DAM method has achieved great success on these difficult tasks, i.e., the 1st place on Stanford CheXpert competition (by the paper submission date) and Top 1% rank (rank 33 out of 3314 teams) on Kaggle 2020 Melanoma classification competition. We also conduct extensive ablation studies to demonstrate the advantages of the new AUC margin loss over the AUC square loss on benchmark datasets. To the best of our knowledge, this is the first work that makes DAM succeed on large-scale medical image datasets.

124.Selective Eye-gaze Augmentation To Enhance Imitation Learning In Atari Games ⬇️

This paper presents the selective use of eye-gaze information in learning human actions in Atari games. Vast evidence suggests that our eye movement convey a wealth of information about the direction of our attention and mental states and encode the information necessary to complete a task. Based on this evidence, we hypothesize that selective use of eye-gaze, as a clue for attention direction, will enhance the learning from demonstration. For this purpose, we propose a selective eye-gaze augmentation (SEA) network that learns when to use the eye-gaze information. The proposed network architecture consists of three sub-networks: gaze prediction, gating, and action prediction network. Using the prior 4 game frames, a gaze map is predicted by the gaze prediction network which is used for augmenting the input frame. The gating network will determine whether the predicted gaze map should be used in learning and is fed to the final network to predict the action at the current frame. To validate this approach, we use publicly available Atari Human Eye-Tracking And Demonstration (Atari-HEAD) dataset consists of 20 Atari games with 28 million human demonstrations and 328 million eye-gazes (over game frames) collected from four subjects. We demonstrate the efficacy of selective eye-gaze augmentation in comparison with state of the art Attention Guided Imitation Learning (AGIL), Behavior Cloning (BC). The results indicate that the selective augmentation approach (the SEA network) performs significantly better than the AGIL and BC. Moreover, to demonstrate the significance of selective use of gaze through the gating network, we compare our approach with the random selection of the gaze. Even in this case, the SEA network performs significantly better validating the advantage of selectively using the gaze in demonstration learning.

125.Development and Characterization of a Chest CT Atlas ⬇️

A major goal of lung cancer screening is to identify individuals with particular phenotypes that are associated with high risk of cancer. Identifying relevant phenotypes is complicated by the variation in body position and body composition. In the brain, standardized coordinate systems (e.g., atlases) have enabled separate consideration of local features from gross/global structure. To date, no analogous standard atlas has been presented to enable spatial mapping and harmonization in chest computational tomography (CT). In this paper, we propose a thoracic atlas built upon a large low dose CT (LDCT) database of lung cancer screening program. The study cohort includes 466 male and 387 female subjects with no screening detected malignancy (age 46-79 years, mean 64.9 years). To provide spatial mapping, we optimize a multi-stage inter-subject non-rigid registration pipeline for the entire thoracic space. We evaluate the optimized pipeline relative to two baselines with alternative non-rigid registration module: the same software with default parameters and an alternative software. We achieve a significant improvement in terms of registration success rate based on manual QA. For the entire study cohort, the optimized pipeline achieves a registration success rate of 91.7%. The application validity of the developed atlas is evaluated in terms of discriminative capability for different anatomic phenotypes, including body mass index (BMI), chronic obstructive pulmonary disease (COPD), and coronary artery calcification (CAC).

126.When Do Curricula Work? ⬇️

Inspired by human learning, researchers have proposed ordering examples during training based on their difficulty. Both curriculum learning, exposing a network to easier examples early in training, and anti-curriculum learning, showing the most difficult examples first, have been suggested as improvements to the standard i.i.d. training. In this work, we set out to investigate the relative benefits of ordered learning. We first investigate the \emph{implicit curricula} resulting from architectural and optimization bias and find that samples are learned in a highly consistent order. Next, to quantify the benefit of \emph{explicit curricula}, we conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum -- in which the size of the training dataset is dynamically increased over time, but the examples are randomly ordered. We find that for standard benchmark datasets, curricula have only marginal benefits, and that randomly ordered samples perform as well or better than curricula and anti-curricula, suggesting that any benefit is entirely due to the dynamic training set size. Inspired by common use cases of curriculum learning in practice, we investigate the role of limited training time budget and noisy data in the success of curriculum learning. Our experiments demonstrate that curriculum, but not anti-curriculum can indeed improve the performance either with limited training time budget or in existence of noisy data.

127.A Survey on Deep Learning with Noisy Labels: How to train your model when you cannot trust on the annotations? ⬇️

Noisy Labels are commonly present in data sets automatically collected from the internet, mislabeled by non-specialist annotators, or even specialists in a challenging task, such as in the medical field. Although deep learning models have shown significant improvements in different domains, an open issue is their ability to memorize noisy labels during training, reducing their generalization potential. As deep learning models depend on correctly labeled data sets and label correctness is difficult to guarantee, it is crucial to consider the presence of noisy labels for deep learning training. Several approaches have been proposed in the literature to improve the training of deep learning models in the presence of noisy labels. This paper presents a survey on the main techniques in literature, in which we classify the algorithm in the following groups: robust losses, sample weighting, sample selection, meta-learning, and combined approaches. We also present the commonly used experimental setup, data sets, and results of the state-of-the-art models.

128.Automatic Segmentation and Location Learning of Neonatal Cerebral Ventricles in 3D Ultrasound Data Combining CNN and CPPN ⬇️

Preterm neonates are highly likely to suffer from ventriculomegaly, a dilation of the Cerebral Ventricular System (CVS). This condition can develop into life-threatening hydrocephalus and is correlated with future neuro-developmental impairments. Consequently, it must be detected and monitored by physicians. In clinical routing, manual 2D measurements are performed on 2D ultrasound (US) images to estimate the CVS volume but this practice is imprecise due to the unavailability of 3D information. A way to tackle this problem would be to develop automatic CVS segmentation algorithms for 3D US data. In this paper, we investigate the potential of 2D and 3D Convolutional Neural Networks (CNN) to solve this complex task and propose to use Compositional Pattern Producing Network (CPPN) to enable the CNNs to learn CVS location. Our database was composed of 25 3D US volumes collected on 21 preterm nenonates at the age of $35.8 \pm 1.6$ gestational weeks. We found that the CPPN enables to encode CVS location, which increases the accuracy of the CNNs when they have few layers. Accuracy of the 2D and 3D CNNs reached intraobserver variability (IOV) in the case of dilated ventricles with Dice of $0.893 \pm 0.008$ and $0.886 \pm 0.004$ respectively (IOV = $0.898 \pm 0.008$) and with volume errors of $0.45 \pm 0.42$ cm$^3$ and $0.36 \pm 0.24$ cm$^3$ respectively (IOV = $0.41 \pm 0.05$ cm$^3$). 3D CNNs were more accurate than 2D CNNs in the case of normal ventricles with Dice of $0.797 \pm 0.041$ against $0.776 \pm 0.038$ (IOV = $0.816 \pm 0.009$) and volume errors of $0.35 \pm 0.29$ cm$^3$ against $0.35 \pm 0.24$ cm$^3$ (IOV = $0.2 \pm 0.11$ cm$^3$). The best segmentation time of volumes of size $320 \times 320 \times 320$ was obtained by a 2D CNN in $3.5 \pm 0.2$ s.

129.SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams ⬇️

We present SpeakingFaces as a publicly-available large-scale dataset developed to support multimodal machine learning research in contexts that utilize a combination of thermal, visual, and audio data streams; examples include human-computer interaction (HCI), biometric authentication, recognition systems, domain transfer, and speech recognition. SpeakingFaces is comprised of well-aligned high-resolution thermal and visual spectra image streams of fully-framed faces synchronized with audio recordings of each subject speaking approximately 100 imperative phrases. Data was collected from 142 subjects, yielding over 13,000 instances of synchronized data (~3.8 TB). For technical validation, we demonstrate two baseline examples. The first baseline shows classification by gender, utilizing different combinations of the three data streams in both clean and noisy environments. The second example consists of thermal-to-visual facial image translation, as an instance of domain transfer.

130.iGibson, a Simulation Environment for Interactive Tasks in Large RealisticScenes ⬇️

We present iGibson, a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes. Our environment contains fifteen fully interactive home-sized scenes populated with rigid and articulated objects. The scenes are replicas of 3D scanned real-world homes, aligning the distribution of objects and layout to that of the real world. iGibson integrates several key features to facilitate the study of interactive tasks: i) generation of high-quality visual virtual sensor signals (RGB, depth, segmentation, LiDAR, flow, among others), ii) domain randomization to change the materials of the objects (both visual texture and dynamics) and/or their shapes, iii) integrated sampling-based motion planners to generate collision-free trajectories for robot bases and arms, and iv) intuitive human-iGibson interface that enables efficient collection of human demonstrations. Through experiments, we show that the full interactivity of the scenes enables agents to learn useful visual representations that accelerate the training of downstream manipulation tasks. We also show that iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of simple human demonstrated behaviors. iGibson is open-sourced with comprehensive examples and documentation. For more information, visit our project website: this http URL

131.Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment ⬇️

The natural world is abundant with concepts expressed via visual, acoustic, tactile, and linguistic modalities. Much of the existing progress in multimodal learning, however, focuses primarily on problems where the same set of modalities are present at train and test time, which makes learning in low-resource modalities particularly difficult. In this work, we propose algorithms for cross-modal generalization: a learning paradigm to train a model that can (1) quickly perform new tasks in a target modality (i.e. meta-learning) and (2) doing so while being trained on a different source modality. We study a key research question: how can we ensure generalization across modalities despite using separate encoders for different source and target modalities? Our solution is based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data while ensuring quick generalization to new tasks across different modalities. We study this problem on 3 classification tasks: text to image, image to audio, and text to speech. Our results demonstrate strong performance even when the new target modality has only a few (1-10) labeled samples and in the presence of noisy labels, a scenario particularly prevalent in low-resource modalities.

132.Visually Grounded Compound PCFGs ⬇️

Exploiting visual groundings for language understanding has recently been drawing much attention. In this work, we study visually grounded grammar induction and learn a constituency parser from both unlabeled text and its visual groundings. Existing work on this task (Shi et al., 2019) optimizes a parser via Reinforce and derives the learning signal only from the alignment of images and sentences. While their model is relatively accurate overall, its error distribution is very uneven, with low performance on certain constituents types (e.g., 26.2% recall on verb phrases, VPs) and high on others (e.g., 79.6% recall on noun phrases, NPs). This is not surprising as the learning signal is likely insufficient for deriving all aspects of phrase-structure syntax and gradient estimates are noisy. We show that using an extension of probabilistic context-free grammar model we can do fully-differentiable end-to-end visually grounded learning. Additionally, this enables us to complement the image-text alignment loss with a language modeling objective. On the MSCOCO test captions, our model establishes a new state of the art, outperforming its non-grounded version and, thus, confirming the effectiveness of visual groundings in constituency grammar induction. It also substantially outperforms the previous grounded model, with largest improvements on more `abstract' categories (e.g., +55.1% recall on VPs).