ArXiv cs.CV --Mon, 2 Nov 2020

1.MichiGAN: Multi-Input-Conditioned Hair Image Generation for Portrait Editing ⬇️

Despite the recent success of face image generation with GANs, conditional hair editing remains challenging due to the under-explored complexity of its geometry and appearance. In this paper, we present MichiGAN (Multi-Input-Conditioned Hair Image GAN), a novel conditional image generation method for interactive portrait hair manipulation. To provide user control over every major hair visual factor, we explicitly disentangle hair into four orthogonal attributes, including shape, structure, appearance, and background. For each of them, we design a corresponding condition module to represent, process, and convert user inputs, and modulate the image generation pipeline in ways that respect the natures of different visual attributes. All these condition modules are integrated with the backbone generator to form the final end-to-end network, which allows fully-conditioned hair generation from multiple user inputs. Upon it, we also build an interactive portrait hair editing system that enables straightforward manipulation of hair by projecting intuitive and high-level user inputs such as painted masks, guiding strokes, or reference photos to well-defined condition representations. Through extensive experiments and evaluations, we demonstrate the superiority of our method regarding both result quality and user controllability. The code is available at this https URL.

2.Unsupervised Monocular Depth Learning in Dynamic Scenes ⬇️

We present a method for jointly training the estimation of depth, ego-motion, and a dense 3D translation field of objects relative to the scene, with monocular photometric consistency being the sole source of supervision. We show that this apparently heavily underdetermined problem can be regularized by imposing the following prior knowledge about 3D translation fields: they are sparse, since most of the scene is static, and they tend to be constant for rigid moving objects. We show that this regularization alone is sufficient to train monocular depth prediction models that exceed the accuracy achieved in prior work for dynamic scenes, including methods that require semantic input. Code is at this https URL .

3.What's in a Loss Function for Image Classification? ⬇️

It is common to use the softmax cross-entropy loss to train neural networks on classification datasets where a single class label is assigned to each example. However, it has been shown that modifying softmax cross-entropy with label smoothing or regularizers such as dropout can lead to higher performance. This paper studies a variety of loss functions and output layer regularization strategies on image classification tasks. We observe meaningful differences in model predictions, accuracy, calibration, and out-of-distribution robustness for networks trained with different objectives. However, differences in hidden representations of networks trained with different objectives are restricted to the last few layers; representational similarity reveals no differences among network layers that are not close to the output. We show that all objectives that improve over vanilla softmax loss produce greater class separation in the penultimate layer of the network, which potentially accounts for improved performance on the original task, but results in features that transfer worse to other tasks.

4.Emotion Understanding in Videos Through Body, Context, and Visual-Semantic Embedding Loss ⬇️

We present our winning submission to the First International Workshop on Bodily Expressed Emotion Understanding (BEEU) challenge. Based on recent literature on the effect of context/environment on emotion, as well as visual representations with semantic meaning using word embeddings, we extend the framework of Temporal Segment Network to accommodate these. Our method is verified on the validation set of the Body Language Dataset (BoLD) and achieves 0.26235 Emotion Recognition Score on the test set, surpassing the previous best result of 0.2530.

5.Automatic Counting and Identification of Train Wagons Based on Computer Vision and Deep Learning ⬇️

In this work, we present a robust and efficient solution for counting and identifying train wagons using computer vision and deep learning. The proposed solution is cost-effective and can easily replace solutions based on radiofrequency identification (RFID), which are known to have high installation and maintenance costs. According to our experiments, our two-stage methodology achieves impressive results on real-world scenarios, i.e., 100% accuracy in the counting stage and 99.7% recognition rate in the identification one. Moreover, the system is able to automatically reject some of the train wagons successfully counted, as they have damaged identification codes. The results achieved were surprising considering that the proposed system requires low processing power (i.e., it can run in low-end setups) and that we used a relatively small number of images to train our Convolutional Neural Network (CNN) for character recognition. The proposed method is registered, under number BR512020000808-9, with the National Institute of Industrial Property (Brazil).

6.All-Weather Object Recognition Using Radar and Infrared Sensing ⬇️

Autonomous cars are an emergent technology which has the capacity to change human lives. The current sensor systems which are most capable of perception are based on optical sensors. For example, deep neural networks show outstanding results in recognising objects when used to process data from cameras and Light Detection And Ranging (LiDAR) sensors. However these sensors perform poorly under adverse weather conditions such as rain, fog, and snow due to the sensor wavelengths. This thesis explores new sensing developments based on long wave polarised infrared (IR) imagery and imaging radar to recognise objects. First, we developed a methodology based on Stokes parameters using polarised infrared data to recognise vehicles using deep neural networks. Second, we explored the potential of using only the power spectrum captured by low-THz radar sensors to perform object recognition in a controlled scenario. This latter work is based on a data-driven approach together with the development of a data augmentation method based on attenuation, range and speckle noise. Last, we created a new large-scale dataset in the "wild" with many different weather scenarios (sunny, overcast, night, fog, rain and snow) showing radar robustness to detect vehicles in adverse weather. High resolution radar and polarised IR imagery, combined with a deep learning approach, are shown as a potential alternative to current automotive sensing systems based on visible spectrum optical technology as they are more robust in severe weather and adverse light conditions.

7.3D Object Recognition By Corresponding and Quantizing Neural 3D Scene Representations ⬇️

We propose a system that learns to detect objects and infer their 3D poses in RGB-D images. Many existing systems can identify objects and infer 3D poses, but they heavily rely on human labels and 3D annotations. The challenge here is to achieve this without relying on strong supervision signals. To address this challenge, we propose a model that maps RGB-D images to a set of 3D visual feature maps in a differentiable fully-convolutional manner, supervised by predicting views. The 3D feature maps correspond to a featurization of the 3D world scene depicted in the images. The object 3D feature representations are invariant to camera viewpoint changes or zooms, which means feature matching can identify similar objects under different camera viewpoints. We can compare the 3D feature maps of two objects by searching alignment across scales and 3D rotations, and, as a result of the operation, we can estimate pose and scale changes without the need for 3D pose annotations. We cluster object feature maps into a set of 3D prototypes that represent familiar objects in canonical scales and orientations. We then parse images by inferring the prototype identity and 3D pose for each detected object. We compare our method to numerous baselines that do not learn 3D feature visual representations or do not attempt to correspond features across scenes, and outperform them by a large margin in the tasks of object retrieval and object pose estimation. Thanks to the 3D nature of the object-centric feature maps, the visual similarity cues are invariant to 3D pose changes or small scale changes, which gives our method an advantage over 2D and 1D methods.

8.Exploring Dynamic Context for Multi-path Trajectory Prediction ⬇️

To accurately predict future positions of different agents in traffic scenarios is crucial for safely deploying intelligent autonomous systems in the real-world environment. However, it remains a challenge due to the behavior of a target agent being affected by other agents dynamically, and there being more than one socially possible paths the agent could take. In this paper, we propose a novel framework, named Dynamic Context Encoder Network (DCENet). In our framework, first, the spatial context between agents is explored by using self-attention architectures. Then, two LSTM encoders are trained to learn temporal context between steps by taking the observed trajectories and the extracted dynamic spatial context as input, respectively. The spatial-temporal context is encoded into a latent space using a Conditional Variational Auto-Encoder (CVAE) module. Finally, a set of future trajectories for each agent is predicted conditioned on the learned spatial-temporal context by sampling from the latent space, repeatedly. DCENet is evaluated on the largest and most challenging trajectory forecasting benchmark Trajnet and reports a new state-of-the-art performance. It also demonstrates superior performance evaluated on the benchmark InD for mixed traffic at intersections. A series of ablation studies are conducted to validate the effectiveness of each proposed module. Our code is available at

9.Experimental design for MRI by greedy policy search ⬇️

In today's clinical practice, magnetic resonance imaging (MRI) is routinely accelerated through subsampling of the associated Fourier domain. Currently, the construction of these subsampling strategies - known as experimental design - relies primarily on heuristics. We propose to learn experimental design strategies for accelerated MRI with policy gradient methods. Unexpectedly, our experiments show that a simple greedy approximation of the objective leads to solutions nearly on-par with the more general non-greedy approach. We offer a partial explanation for this phenomenon rooted in greater variance in the non-greedy objective's gradient estimates, and experimentally verify that this variance hampers non-greedy models in adapting their policies to individual MR images. We empirically show that this adaptivity is key to improving subsampling designs.

10.HOI Analysis: Integrating and Decomposing Human-Object Interaction ⬇️

Human-Object Interaction (HOI) consists of human, object and implicit interaction/verb. Different from previous methods that directly map pixels to HOI semantics, we propose a novel perspective for HOI learning in an analytical manner. In analogy to Harmonic Analysis, whose goal is to study how to represent the signals with the superposition of basic waves, we propose the HOI Analysis. We argue that coherent HOI can be decomposed into isolated human and object. Meanwhile, isolated human and object can also be integrated into coherent HOI again. Moreover, transformations between human-object pairs with the same HOI can also be easier approached with integration and decomposition. As a result, the implicit verb will be represented in the transformation function space. In light of this, we propose an Integration-Decomposition Network (IDN) to implement the above transformations and achieve state-of-the-art performance on widely-used HOI detection benchmarks. Code is available at this https URL.

11.Statistical Analysis of Signal-Dependent Noise: Application in Blind Localization of Image Splicing Forgery ⬇️

Visual noise is often regarded as a disturbance in image quality, whereas it can also provide a crucial clue for image-based forensic tasks. Conventionally, noise is assumed to comprise an additive Gaussian model to be estimated and then used to reveal anomalies. However, for real sensor noise, it should be modeled as signal-dependent noise (SDN). In this work, we apply SDN to splicing forgery localization tasks. Through statistical analysis of the SDN model, we assume that noise can be modeled as a Gaussian approximation for a certain brightness and propose a likelihood model for a noise level function. By building a maximum a posterior Markov random field (MAP-MRF) framework, we exploit the likelihood of noise to reveal the alien region of spliced objects, with a probability combination refinement strategy. To ensure a completely blind detection, an iterative alternating method is adopted to estimate the MRF parameters. Experimental results demonstrate that our method is effective and provides a comparative localization performance.

12.End-to-end Animal Image Matting ⬇️

Extracting accurate foreground animals from natural animal images benefits many downstream applications such as film production and augmented reality. However, the various appearance and furry characteristics of animals challenge existing matting methods, which usually require extra user inputs such as trimap or scribbles. To resolve these problems, we study the distinct roles of semantics and details for image matting and decompose the task into two parallel sub-tasks: high-level semantic segmentation and low-level details matting. Specifically, we propose a novel Glance and Focus Matting network (GFM), which employs a shared encoder and two separate decoders to learn both tasks in a collaborative manner for end-to-end animal image matting. Besides, we establish a novel Animal Matting dataset (AM-2k) containing 2,000 high-resolution natural animal images from 20 categories along with manually labeled alpha mattes. Furthermore, we investigate the domain gap issue between composite images and natural images systematically by conducting comprehensive analyses of various discrepancies between foreground and background images. We find that a carefully designed composition route RSSN that aims to reduce the discrepancies can lead to a better model with remarkable generalization ability. Comprehensive empirical studies on AM-2k demonstrate that GFM outperforms state-of-the-art methods and effectively reduces the generalization error.

13.Small Noisy and Perspective Face Detection using Deformable Symmetric Gabor Wavelet Network ⬇️

Face detection and tracking in low resolution image is not a trivial task due to the limitation in the appearance features for face characterization. Moreover, facial expression gives additional distortion on this small and noisy face. In this paper, we propose deformable symmetric Gabor wavelet network face model for face detection in low resolution image. Our model optimizes the rotation, translation, dilation, perspective and partial deformation amount of the face model with symmetry constraints. Symmetry constraints help our model to be more robust to noise and distortion. Experimental results on our low resolution face image dataset and videos show promising face detection and tracking results under various challenging conditions.

14.Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation ⬇️

Panoptic segmentation is posed as a new popular test-bed for the state-of-the-art holistic scene understanding methods with the requirement of simultaneously segmenting both foreground things and background stuff. The state-of-the-art panoptic segmentation network exhibits high structural complexity in different network components, i.e. backbone, proposal-based foreground branch, segmentation-based background branch, and feature fusion module across branches, which heavily relies on expert knowledge and tedious trials. In this work, we propose an efficient, cooperative and highly automated framework to simultaneously search for all main components including backbone, segmentation branches, and feature fusion module in a unified panoptic segmentation pipeline based on the prevailing one-shot Network Architecture Search (NAS) paradigm. Notably, we extend the common single-task NAS into the multi-component scenario by taking the advantage of the newly proposed intra-modular search space and problem-oriented inter-modular search space, which helps us to obtain an optimal network architecture that not only performs well in both instance segmentation and semantic segmentation tasks but also be aware of the reciprocal relations between foreground things and background stuff classes. To relieve the vast computation burden incurred by applying NAS to complicated network architectures, we present a novel path-priority greedy search policy to find a robust, transferrable architecture with significantly reduced searching overhead. Our searched architecture, namely Auto-Panoptic, achieves the new state-of-the-art on the challenging COCO and ADE20K benchmarks. Moreover, extensive experiments are conducted to demonstrate the effectiveness of path-priority policy and transferability of Auto-Panoptic across different datasets. Codes and models are available at: this https URL.

15.PyraPose: Feature Pyramids for Fast and Accurate Object Pose Estimation under Domain Shift ⬇️

Object pose estimation enables robots to understand and interact with their environments. Training with synthetic data is necessary in order to adapt to novel situations. Unfortunately, pose estimation under domain shift, i.e., training on synthetic data and testing in the real world, is challenging. Deep learning-based approaches currently perform best when using encoder-decoder networks but typically do not generalize to new scenarios with different scene characteristics. We argue that patch-based approaches, instead of encoder-decoder networks, are more suited for synthetic-to-real transfer because local to global object information is better represented. To that end, we present a novel approach based on a specialized feature pyramid network to compute multi-scale features for creating pose hypotheses on different feature map resolutions in parallel. Our single-shot pose estimation approach is evaluated on multiple standard datasets and outperforms the state of the art by up to 35%. We also perform grasping experiments in the real world to demonstrate the advantage of using synthetic data to generalize to novel environments.

16.An Unsupervised Approach towards Varying Human Skin Tone Using Generative Adversarial Networks ⬇️

With the increasing popularity of augmented and virtual reality, retailers are now focusing more towards customer satisfaction to increase the amount of sales. Although augmented reality is not a new concept but it has gained much needed attention over the past few years. Our present work is targeted towards this direction which may be used to enhance user experience in various virtual and augmented reality based applications. We propose a model to change skin tone of a person. Given any input image of a person or a group of persons with some value indicating the desired change of skin color towards fairness or darkness, this method can change the skin tone of the persons in the image. This is an unsupervised method and also unconstrained in terms of pose, illumination, number of persons in the image etc. The goal of this work is to reduce the time and effort which is generally required for changing the skin tone using existing applications (e.g., Photoshop) by professionals or novice. To establish the efficacy of this method we have compared our result with that of some popular photo editor and also with the result of some existing benchmark method related to human attribute manipulation. Rigorous experiments on different datasets show the effectiveness of this method in terms of synthesizing perceptually convincing outputs.

17.Correspondence Matrices are Underrated ⬇️

Point-cloud registration (PCR) is an important task in various applications such as robotic manipulation, augmented and virtual reality, SLAM, etc. PCR is an optimization problem involving minimization over two different types of interdependent variables: transformation parameters and point-to-point correspondences. Recent developments in deep-learning have produced computationally fast approaches for PCR. The loss functions that are optimized in these networks are based on the error in the transformation parameters. We hypothesize that these methods would perform significantly better if they calculated their loss function using correspondence error instead of only using error in transformation parameters. We define correspondence error as a metric based on incorrectly matched point pairs. We provide a fundamental explanation for why this is the case and test our hypothesis by modifying existing methods to use correspondence-based loss instead of transformation-based loss. These experiments show that the modified networks converge faster and register more accurately even at larger misalignment when compared to the original networks.

18.LIFI: Towards Linguistically Informed Frame Interpolation ⬇️

In this work, we explore a new problem of frame interpolation for speech videos. Such content today forms the major form of online communication. We try to solve this problem by using several deep learning video generation algorithms to generate the missing frames. We also provide examples where computer vision models despite showing high performance on conventional non-linguistic metrics fail to accurately produce faithful interpolation of speech. With this motivation, we provide a new set of linguistically-informed metrics specifically targeted to the problem of speech videos interpolation. We also release several datasets to test computer vision video generation models of their speech understanding.

19.Volumetric Medical Image Segmentation: A 3D Deep Coarse-to-fine Framework and Its Adversarial Examples ⬇️

Although deep neural networks have been a dominant method for many 2D vision tasks, it is still challenging to apply them to 3D tasks, such as medical image segmentation, due to the limited amount of annotated 3D data and limited computational resources. In this chapter, by rethinking the strategy to apply 3D Convolutional Neural Networks to segment medical images, we propose a novel 3D-based coarse-to-fine framework to efficiently tackle these challenges. The proposed 3D-based framework outperforms their 2D counterparts by a large margin since it can leverage the rich spatial information along all three axes. We further analyze the threat of adversarial attacks on the proposed framework and show how to defense against the attack. We conduct experiments on three datasets, the NIH pancreas dataset, the JHMI pancreas dataset and the JHMI pathological cyst dataset, where the first two and the last one contain healthy and pathological pancreases respectively, and achieve the current state-of-the-art in terms of Dice-Sorensen Coefficient (DSC) on all of them. Especially, on the NIH pancreas segmentation dataset, we outperform the previous best by an average of over $2%$, and the worst case is improved by $7%$ to reach almost $70%$, which indicates the reliability of our framework in clinical applications.

20.CNN based Multistage Gated Average Fusion (MGAF) for Human Action Recognition Using Depth and Inertial Sensors ⬇️

Convolutional Neural Network (CNN) provides leverage to extract and fuse features from all layers of its architecture. However, extracting and fusing intermediate features from different layers of CNN structure is still uninvestigated for Human Action Recognition (HAR) using depth and inertial sensors. To get maximum benefit of accessing all the CNN's layers, in this paper, we propose novel Multistage Gated Average Fusion (MGAF) network which extracts and fuses features from all layers of CNN using our novel and computationally efficient Gated Average Fusion (GAF) network, a decisive integral element of MGAF. At the input of the proposed MGAF, we transform the depth and inertial sensor data into depth images called sequential front view images (SFI) and signal images (SI) respectively. These SFI are formed from the front view information generated by depth data. CNN is employed to extract feature maps from both input modalities. GAF network fuses the extracted features effectively while preserving the dimensionality of fused feature as well. The proposed MGAF network has structural extensibility and can be unfolded to more than two modalities. Experiments on three publicly available multimodal HAR datasets demonstrate that the proposed MGAF outperforms the previous state of the art fusion methods for depth-inertial HAR in terms of recognition accuracy while being computationally much more efficient. We increase the accuracy by an average of 1.5 percent while reducing the computational cost by approximately 50 percent over the previous state of the art.

21.SMOT: Single-Shot Multi Object Tracking ⬇️

We present single-shot multi-object tracker (SMOT), a new tracking framework that converts any single-shot detector (SSD) model into an online multiple object tracker, which emphasizes simultaneously detecting and tracking of the object paths. Contrary to the existing tracking by detection approaches which suffer from errors made by the object detectors, SMOT adopts the recently proposed scheme of tracking by re-detection. We combine this scheme with SSD detectors by proposing a novel tracking anchor assignment module. With this design SMOT is able to generate tracklets with a constant per-frame runtime. A light-weighted linkage algorithm is then used for online tracklet linking. On three benchmarks of object tracking: Hannah, Music Videos, and MOT17, the proposed SMOT achieves state-of-the-art performance.

22.Loss-rescaling VQA: Revisiting Language Prior Problem from a Class-imbalance View ⬇️

Recent studies have pointed out that many well-developed Visual Question Answering (VQA) models are heavily affected by the language prior problem, which refers to making predictions based on the co-occurrence pattern between textual questions and answers instead of reasoning visual contents. To tackle it, most existing methods focus on enhancing visual feature learning to reduce this superficial textual shortcut influence on VQA model decisions. However, limited effort has been devoted to providing an explicit interpretation for its inherent cause. It thus lacks a good guidance for the research community to move forward in a purposeful way, resulting in model construction perplexity in overcoming this non-trivial problem. In this paper, we propose to interpret the language prior problem in VQA from a class-imbalance view. Concretely, we design a novel interpretation scheme whereby the loss of mis-predicted frequent and sparse answers of the same question type is distinctly exhibited during the late training phase. It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer, to a given question whose right answer is sparse in the training set. Based upon this observation, we further develop a novel loss re-scaling approach to assign different weights to each answer based on the training data statistics for computing the final loss. We apply our approach into three baselines and the experimental results on two VQA-CP benchmark datasets evidently demonstrate its effectiveness. In addition, we also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.

23.Can the state of relevant neurons in a deep neural networks serve as indicators for detecting adversarial attacks? ⬇️

We present a method for adversarial attack detection based on the inspection of a sparse set of neurons. We follow the hypothesis that adversarial attacks introduce imperceptible perturbations in the input and that these perturbations change the state of neurons relevant for the concepts modelled by the attacked model. Therefore, monitoring the status of these neurons would enable the detection of adversarial attacks. Focusing on the image classification task, our method identifies neurons that are relevant for the classes predicted by the model. A deeper qualitative inspection of these sparse set of neurons indicates that their state changes in the presence of adversarial samples. Moreover, quantitative results from our empirical evaluation indicate that our method is capable of recognizing adversarial samples, produced by state-of-the-art attack methods, with comparable accuracy to that of state-of-the-art detectors.

24.PAL : Pretext-based Active Learning ⬇️

When obtaining labels is expensive, the requirement of a large labeled training data set for deep learning can be mitigated by active learning. Active learning refers to the development of algorithms to judiciously pick limited subsets of unlabeled samples that can be sent for labeling by an oracle. We propose an intuitive active learning technique that, in addition to the task neural network (e.g., for classification), uses an auxiliary self-supervised neural network that assesses the utility of an unlabeled sample for inclusion in the labeled set. Our core idea is that the difficulty of the auxiliary network trained on labeled samples to solve a self-supervision task on an unlabeled sample represents the utility of obtaining the label of that unlabeled sample. Specifically, we assume that an unlabeled image on which the precision of predicting a random applied geometric transform is low must be out of the distribution represented by the current set of labeled images. These images will therefore maximize the relative information gain when labeled by the oracle. We also demonstrate that augmenting the auxiliary network with task specific training further improves the results. We demonstrate strong performance on a range of widely used datasets and establish a new state of the art for active learning. We also make our code publicly available to encourage further research.

25.Detecting small polyps using a Dynamic SSD-GAN ⬇️

Endoscopic examinations are used to inspect the throat, stomach and bowel for polyps which could develop into cancer. Machine learning systems can be trained to process colonoscopy images and detect polyps. However, these systems tend to perform poorly on objects which appear visually small in the images. It is shown here that combining the single-shot detector as a region proposal network with an adversarially-trained generator to upsample small region proposals can significantly improve the detection of visually-small polyps. The Dynamic SSD-GAN pipeline introduced in this paper achieved a 12% increase in sensitivity on visually-small polyps compared to a conventional FCN baseline.

26.A Comprehensive Comparison of End-to-End Approaches for Handwritten Digit String Recognition ⬇️

Over the last decades, most approaches proposed for handwritten digit string recognition (HDSR) have resorted to digit segmentation, which is dominated by heuristics, thereby imposing substantial constraints on the final performance. Few of them have been based on segmentation-free strategies where each pixel column has a potential cut location. Recently, segmentation-free strategies has added another perspective to the problem, leading to promising results. However, these strategies still show some limitations when dealing with a large number of touching digits. To bridge the resulting gap, in this paper, we hypothesize that a string of digits can be approached as a sequence of objects. We thus evaluate different end-to-end approaches to solve the HDSR problem, particularly in two verticals: those based on object-detection (e.g., Yolo and RetinaNet) and those based on sequence-to-sequence representation (CRNN). The main contribution of this work lies in its provision of a comprehensive comparison with a critical analysis of the above mentioned strategies on five benchmarks commonly used to assess HDSR, including the challenging Touching Pair dataset, NIST SD19, and two real-world datasets (CAR and CVL) proposed for the ICFHR 2014 competition on HDSR. Our results show that the Yolo model compares favorably against segmentation-free models with the advantage of having a shorter pipeline that minimizes the presence of heuristics-based models. It achieved a 97%, 96%, and 84% recognition rate on the NIST-SD19, CAR, and CVL datasets, respectively.

27.Perception Matters: Exploring Imperceptible and Transferable Anti-forensics for GAN-generated Fake Face Imagery Detection ⬇️

Recently, generative adversarial networks (GANs) can generate photo-realistic fake facial images which are perceptually indistinguishable from real face photos, promoting research on fake face detection. Though fake face forensics can achieve high detection accuracy, their anti-forensic counterparts are less investigated. Here we explore more \textit{imperceptible} and \textit{transferable} anti-forensics for fake face imagery detection based on adversarial attacks. Since facial and background regions are often smooth, even small perturbation could cause noticeable perceptual impairment in fake face images. Therefore it makes existing adversarial attacks ineffective as an anti-forensic method. Our perturbation analysis reveals the intuitive reason of the perceptual degradation issue when directly applying existing attacks. We then propose a novel adversarial attack method, better suitable for image anti-forensics, in the transformed color domain by considering visual perception. Simple yet effective, the proposed method can fool both deep learning and non-deep learning based forensic detectors, achieving higher attack success rate and significantly improved visual quality. Specially, when adversaries consider imperceptibility as a constraint, the proposed anti-forensic method can improve the average attack success rate by around 30% on fake face images over two baseline attacks. \textit{More imperceptible} and \textit{more transferable}, the proposed method raises new security concerns to fake face imagery detection. We have released our code for public use, and hopefully the proposed method can be further explored in related forensic applications as an anti-forensic benchmark.

28.Development and Evaluation of a Deep Neural Network for Histologic Classification of Renal Cell Carcinoma on Biopsy and Surgical Resection Slides ⬇️

Renal cell carcinoma (RCC) is the most common renal cancer in adults. The histopathologic classification of RCC is essential for diagnosis, prognosis, and management of patients. Reorganization and classification of complex histologic patterns of RCC on biopsy and surgical resection slides under a microscope remains a heavily specialized, error-prone, and time-consuming task for pathologists. In this study, we developed a deep neural network model that can accurately classify digitized surgical resection slides and biopsy slides into five related classes: clear cell RCC, papillary RCC, chromophobe RCC, renal oncocytoma, and normal. In addition to the whole-slide classification pipeline, we visualized the identified indicative regions and features on slides for classification by reprocessing patch-level classification results to ensure the explainability of our diagnostic model. We evaluated our model on independent test sets of 78 surgical resection whole slides and 79 biopsy slides from our tertiary medical institution, and 69 randomly selected surgical resection slides from The Cancer Genome Atlas (TCGA) database. The average area under the curve (AUC) of our classifier on the internal resection slides, internal biopsy slides, and external TCGA slides is 0.98, 0.98 and 0.99, respectively. Our results suggest that the high generalizability of our approach across different data sources and specimen types. More importantly, our model has the potential to assist pathologists by (1) automatically pre-screening slides to reduce false-negative cases, (2) highlighting regions of importance on digitized slides to accelerate diagnosis, and (3) providing objective and accurate diagnosis as the second opinion.

29.Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents ⬇️

Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations. Such granular annotation is rare, expensive, and unavailable in most domain-specific contexts. In contrast, unlabeled multi-image, multi-sentence documents are abundant. Can lexical grounding be learned from such documents, even though they have significant lexical and visual overlap? Working with a case study dataset of real estate listings, we demonstrate the challenge of distinguishing highly correlated grounded terms, such as "kitchen" and "bedroom", and introduce metrics to assess this document similarity. We present a simple unsupervised clustering-based method that increases precision and recall beyond object detection and image tagging baselines when evaluated on labeled subsets of the dataset. The proposed method is particularly effective for local contextual meanings of a word, for example associating "granite" with countertops in the real estate dataset and with rocky landscapes in a Wikipedia dataset.

30.DeepWay: a Deep Learning Estimator for Unmanned Ground Vehicle Global Path Planning ⬇️

Agriculture 3.0 and 4.0 have gradually introduced service robotics and automation into several agricultural processes, mostly improving crops quality and seasonal yield. Row-based crops are the perfect settings to test and deploy smart machines capable of monitoring and manage the harvest. In this context, global path planning is essential either for ground or aerial vehicles, and it is the starting point for every type of mission plan. Nevertheless, little attention has been currently given to this problem by the research community and global path planning automation is still far to be solved. In order to generate a viable path for an autonomous machine, the presented research proposes a feature learning fully convolutional model capable of estimating waypoints given an occupancy grid map. In particular, we apply the proposed data-driven methodology to the specific case of row-based crops with the general objective to generate a global path able to cover the extension of the crop completely. Extensive experimentation with a custom made synthetic dataset and real satellite-derived images of different scenarios have proved the effectiveness of our methodology and demonstrated the feasibility of an end-to-end and completely autonomous global path planner.

31.Learning Vision-based Reactive Policies for Obstacle Avoidance ⬇️

In this paper, we address the problem of vision-based obstacle avoidance for robotic manipulators. This topic poses challenges for both perception and motion generation. While most work in the field aims at improving one of those aspects, we provide a unified framework for approaching this problem. The main goal of this framework is to connect perception and motion by identifying the relationship between the visual input and the corresponding motion representation. To this end, we propose a method for learning reactive obstacle avoidance policies. We evaluate our method on goal-reaching tasks for single and multiple obstacles scenarios. We show the ability of the proposed method to efficiently learn stable obstacle avoidance strategies at a high success rate, while maintaining closed-loop responsiveness required for critical applications like human-robot interaction.

32.Automatic Myocardial Infarction Evaluation from Delayed-Enhancement Cardiac MRI using Deep Convolutional Networks ⬇️

In this paper, we propose a new deep learning framework for an automatic myocardial infarction evaluation from clinical information and delayed enhancement-MRI (DE-MRI). The proposed framework addresses two tasks. The first task is automatic detection of myocardial contours, the infarcted area, the no-reflow area, and the left ventricular cavity from a short-axis DE-MRI series. It employs two segmentation neural networks. The first network is used to segment the anatomical structures such as the myocardium and left ventricular cavity. The second network is used to segment the pathological areas such as myocardial infarction, myocardial no-reflow, and normal myocardial region. The segmented myocardium region from the first network is further used to refine the second network's pathological segmentation results. The second task is to automatically classify a given case into normal or pathological from clinical information with or without DE-MRI. A cascaded support vector machine (SVM) is employed to classify a given case from its associated clinical information. The segmented pathological areas from DE-MRI are also used for the classification task. We evaluated our method on the 2020 EMIDEC MICCAI challenge dataset. It yielded an average Dice index of 0.93 and 0.84, respectively, for the left ventricular cavity and the myocardium. The classification from using only clinical information yielded 80% accuracy over five-fold cross-validation. Using the DE-MRI, our method can classify the cases with 93.3% accuracy. These experimental results reveal that the proposed method can automatically evaluate the myocardial infarction.

33.Fusion-Catalyzed Pruning for Optimizing Deep Learning on Intelligent Edge Devices ⬇️

The increasing computational cost of deep neural network models limits the applicability of intelligent applications on resource-constrained edge devices. While a number of neural network pruning methods have been proposed to compress the models, prevailing approaches focus only on parametric operators (e.g., convolution), which may miss optimization opportunities. In this paper, we present a novel fusion-catalyzed pruning approach, called FuPruner, which simultaneously optimizes the parametric and non-parametric operators for accelerating neural networks. We introduce an aggressive fusion method to equivalently transform a model, which extends the optimization space of pruning and enables non-parametric operators to be pruned in a similar manner as parametric operators, and a dynamic filter pruning method is applied to decrease the computational cost of models while retaining the accuracy requirement. Moreover, FuPruner provides configurable optimization options for controlling fusion and pruning, allowing much more flexible performance-accuracy trade-offs to be made. Evaluation with state-of-the-art residual neural networks on five representative intelligent edge platforms, Jetson TX2, Jetson Nano, Edge TPU, NCS, and NCS2, demonstrates the effectiveness of our approach, which can accelerate the inference of models on CIFAR-10 and ImageNet datasets.

34.Bayesian Optimization Meets Laplace Approximation for Robotic Introspection ⬇️

In robotics, deep learning (DL) methods are used more and more widely, but their general inability to provide reliable confidence estimates will ultimately lead to fragile and unreliable systems. This impedes the potential deployments of DL methods for long-term autonomy. Therefore, in this paper we introduce a scalable Laplace Approximation (LA) technique to make Deep Neural Networks (DNNs) more introspective, i.e. to enable them to provide accurate assessments of their failure probability for unseen test data. In particular, we propose a novel Bayesian Optimization (BO) algorithm to mitigate their tendency of under-fitting the true weight posterior, so that both the calibration and the accuracy of the predictions can be simultaneously optimized. We demonstrate empirically that the proposed BO approach requires fewer iterations for this when compared to random search, and we show that the proposed framework can be scaled up to large datasets and architectures.

35.Classifying Malware Images with Convolutional Neural Network Models ⬇️

Due to increasing threats from malicious software (malware) in both number and complexity, researchers have developed approaches to automatic detection and classification of malware, instead of analyzing methods for malware files manually in a time-consuming effort. At the same time, malware authors have developed techniques to evade signature-based detection techniques used by antivirus companies. Most recently, deep learning is being used in malware classification to solve this issue. In this paper, we use several convolutional neural network (CNN) models for static malware classification. In particular, we use six deep learning models, three of which are past winners of the ImageNet Large-Scale Visual Recognition Challenge. The other three models are CNN-SVM, GRU-SVM and MLP-SVM, which enhance neural models with support vector machines (SVM). We perform experiments using the Malimg dataset, which has malware images that were converted from Portable Executable malware binaries. The dataset is divided into 25 malware families. Comparisons show that the Inception V3 model achieves a test accuracy of 99.24%, which is better than the accuracy of 98.52% achieved by the current state-of-the-art system called the M-CNN model.

36.CT-CAPS: Feature Extraction-based Automated Framework for COVID-19 Disease Identification from Chest CT Scans using Capsule Networks ⬇️

The global outbreak of the novel corona virus (COVID-19) disease has drastically impacted the world and led to one of the most challenging crisis across the globe since World War II. The early diagnosis and isolation of COVID-19 positive cases are considered as crucial steps towards preventing the spread of the disease and flattening the epidemic curve. Chest Computed Tomography (CT) scan is a highly sensitive, rapid, and accurate diagnostic technique that can complement Reverse Transcription Polymerase Chain Reaction (RT-PCR) test. Recently, deep learning-based models, mostly based on Convolutional Neural Networks (CNN), have shown promising diagnostic results. CNNs, however, are incapable of capturing spatial relations between image instances and require large datasets. Capsule Networks, on the other hand, can capture spatial relations, require smaller datasets, and have considerably fewer parameters. In this paper, a Capsule network framework, referred to as the "CT-CAPS", is presented to automatically extract distinctive features of chest CT scans. These features, which are extracted from the layer before the final capsule layer, are then leveraged to differentiate COVID-19 from Non-COVID cases. The experiments on our in-house dataset of 307 patients show the state-of-the-art performance with the accuracy of 90.8%, sensitivity of 94.5%, and specificity of 86.0%.

37.COVID-FACT: A Fully-Automated Capsule Network-based Framework for Identification of COVID-19 Cases from Chest CT scans ⬇️

The newly discovered Corona virus Disease 2019 (COVID-19) has been globally spreading and causing hundreds of thousands of deaths around the world as of its first emergence in late 2019. Computed tomography (CT) scans have shown distinctive features and higher sensitivity compared to other diagnostic tests, in particular the current gold standard, i.e., the Reverse Transcription Polymerase Chain Reaction (RT-PCR) test. Current deep learning-based algorithms are mainly developed based on Convolutional Neural Networks (CNNs) to identify COVID-19 pneumonia cases. CNNs, however, require extensive data augmentation and large datasets to identify detailed spatial relations between image instances. Furthermore, existing algorithms utilizing CT scans, either extend slice-level predictions to patient-level ones using a simple thresholding mechanism or rely on a sophisticated infection segmentation to identify the disease. In this paper, we propose a two-stage fully-automated CT-based framework for identification of COVID-19 positive cases referred to as the "COVID-FACT". COVID-FACT utilizes Capsule Networks, as its main building blocks and is, therefore, capable of capturing spatial information. In particular, to make the proposed COVID-FACT independent from sophisticated segmentation of the area of infection, slices demonstrating infection are detected at the first stage and the second stage is responsible for classifying patients into COVID and non-COVID cases. COVID-FACT detects slices with infection, and identifies positive COVID-19 cases using an in-house CT scan dataset, containing COVID-19, community acquired pneumonia, and normal cases. Based on our experiments, COVID-FACT achieves an accuracy of 90.82%, a sensitivity of 94.55%, a specificity of 86.04%, and an Area Under the Curve (AUC) of 0.98, while depending on far less supervision and annotation, in comparison to its counterparts.

38.FLANNEL: Focal Loss Based Neural Network Ensemble for COVID-19 Detection ⬇️

To test the possibility of differentiating chest x-ray images of COVID-19 against other pneumonia and healthy patients using deep neural networks. We construct the X-ray imaging data from two publicly available sources, which include 5508 chest x-ray images across 2874 patients with four classes: normal, bacterial pneumonia, non-COVID-19 viral pneumonia, and COVID-19. To identify COVID-19, we propose a Focal Loss Based Neural Ensemble Network (FLANNEL), a flexible module to ensemble several convolutional neural network (CNN) models and fuse with a focal loss for accurate COVID-19 detection on class imbalance data. FLANNEL consistently outperforms baseline models on COVID-19 identification task in all metrics. Compared with the best baseline, FLANNEL shows a higher macro-F1 score with 6% relative increase on Covid-19 identification task where it achieves 0.7833(0.07) in Precision, 0.8609(0.03) in Recall, and 0.8168(0.03) F1 score.

39.PIINET: A 360-degree Panoramic Image Inpainting Network Using a Cube Map ⬇️

Inpainting has been continuously studied in the field of computer vision. As artificial intelligence technology developed, deep learning technology was introduced in inpainting research, helping to improve performance. Currently, the input target of an inpainting algorithm using deep learning has been studied from a single image to a video. However, deep learning-based inpainting technology for panoramic images has not been actively studied. We propose a 360-degree panoramic image inpainting method using generative adversarial networks (GANs). The proposed network inputs a 360-degree equirectangular format panoramic image converts it into a cube map format, which has relatively little distortion and uses it as a training network. Since the cube map format is used, the correlation of the six sides of the cube map should be considered. Therefore, all faces of the cube map are used as input for the whole discriminative network, and each face of the cube map is used as input for the slice discriminative network to determine the authenticity of the generated image. The proposed network performed qualitatively better than existing single-image inpainting algorithms and baseline algorithms.

40.AutoAtlas: Neural Network for 3D Unsupervised Partitioning and Representation Learning ⬇️

We present a novel neural network architecture called AutoAtlas for fully unsupervised partitioning and representation learning of 3D brain Magnetic Resonance Imaging (MRI) volumes. AutoAtlas consists of two neural network components: one that performs multi-label partitioning based on local texture in the volume and a second that compresses the information contained within each partition. We train both of these components simultaneously by optimizing a loss function that is designed to promote accurate reconstruction of each partition, while encouraging spatially smooth and contiguous partitioning, and discouraging relatively small partitions. We show that the partitions adapt to the subject specific structural variations of brain tissue while consistently appearing at similar spatial locations across subjects. AutoAtlas also produces very low dimensional features that represent local texture of each partition. We demonstrate prediction of metadata associated with each subject using the derived feature representations and compare the results to prediction using features derived from FreeSurfer anatomical parcellation. Since our features are intrinsically linked to distinct partitions, we can then map values of interest, such as partition-specific feature importance scores onto the brain for visualization.

41.Human versus Machine Attention in Deep Reinforcement Learning Tasks ⬇️

Deep reinforcement learning (RL) algorithms are powerful tools for solving visuomotor decision tasks. However, the trained models are often difficult to interpret, because they are represented as end-to-end deep neural networks. In this paper, we shed light on the inner workings of such trained models by analyzing the pixels that they attend to during task execution, and comparing them with the pixels attended to by humans executing the same tasks. To this end, we investigate the following two questions that, to the best of our knowledge, have not been previously studied. 1) How similar are the visual features learned by RL agents and humans when performing the same task? and, 2) How do similarities and differences in these learned features correlate with RL agents' performance on these tasks? Specifically, we compare the saliency maps of RL agents against visual attention models of human experts when learning to play Atari games. Further, we analyze how hyperparameters of the deep RL algorithm affect the learned features and saliency maps of the trained agents. The insights provided by our results have the potential to inform novel algorithms for the purpose of closing the performance gap between human experts and deep RL agents.

42.Multi-agent Trajectory Prediction with Fuzzy Query Attention ⬇️

Trajectory prediction for scenes with multiple agents and entities is a challenging problem in numerous domains such as traffic prediction, pedestrian tracking and path planning. We present a general architecture to address this challenge which models the crucial inductive biases of motion, namely, inertia, relative motion, intents and interactions. Specifically, we propose a relational model to flexibly model interactions between agents in diverse environments. Since it is well-known that human decision making is fuzzy by nature, at the core of our model lies a novel attention mechanism which models interactions by making continuous-valued (fuzzy) decisions and learning the corresponding responses. Our architecture demonstrates significant performance gains over existing state-of-the-art predictive models in diverse domains such as human crowd trajectories, US freeway traffic, NBA sports data and physics datasets. We also present ablations and augmentations to understand the decision-making process and the source of gains in our model.

43.Ink Marker Segmentation in Histopathology Images Using Deep Learning ⬇️

Due to the recent advancements in machine vision, digital pathology has gained significant attention. Histopathology images are distinctly rich in visual information. The tissue glass slide images are utilized for disease diagnosis. Researchers study many methods to process histopathology images and facilitate fast and reliable diagnosis; therefore, the availability of high-quality slides becomes paramount. The quality of the images can be negatively affected when the glass slides are ink-marked by pathologists to delineate regions of interest. As an example, in one of the largest public histopathology datasets, The Cancer Genome Atlas (TCGA), approximately $12%$ of the digitized slides are affected by manual delineations through ink markings. To process these open-access slide images and other repositories for the design and validation of new methods, an algorithm to detect the marked regions of the images is essential to avoid confusing tissue pixels with ink-colored pixels for computer methods. In this study, we propose to segment the ink-marked areas of pathology patches through a deep network. A dataset from $79$ whole slide images with $4,305$ patches was created and different networks were trained. Finally, the results showed an FPN model with the EffiecentNet-B3 as the backbone was found to be the superior configuration with an F1 score of $94.53%$.