Skip to content

Latest commit

 

History

History
111 lines (111 loc) · 72.7 KB

20200731.md

File metadata and controls

111 lines (111 loc) · 72.7 KB

ArXiv cs.CV --Fri, 31 Jul 2020

1.Contrastive Learning for Unpaired Image-to-Image Translation ⬇️

In image-to-image translation, each patch in the output should reflect the content of the corresponding patch in the input, independent of domain. We propose a straightforward method for doing so -- maximizing mutual information between the two, using a framework based on contrastive learning. The method encourages two elements (corresponding patches) to map to a similar point in a learned feature space, relative to other elements (other patches) in the dataset, referred to as negatives. We explore several critical design choices for making contrastive learning effective in the image synthesis setting. Notably, we use a multilayer, patch-based approach, rather than operate on entire images. Furthermore, we draw negatives from within the input image itself, rather than from the rest of the dataset. We demonstrate that our framework enables one-sided translation in the unpaired image-to-image translation setting, while improving quality and reducing training time. In addition, our method can even be extended to the training setting where each "domain" is only a single image.

2.Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild ⬇️

We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene, all from a single image in-the-wild captured in an uncontrolled environment. Notably, our method runs on datasets without any scene- or object-level 3D supervision. Our key insight is that considering humans and objects jointly gives rise to "3D common sense" constraints that can be used to resolve ambiguity. In particular, we introduce a scale loss that learns the distribution of object size from data; an occlusion-aware silhouette re-projection loss to optimize object pose; and a human-object interaction loss to capture the spatial layout of objects with which humans interact. We empirically validate that our constraints dramatically reduce the space of likely 3D spatial configurations. We demonstrate our approach on challenging, in-the-wild images of humans interacting with large objects (such as bicycles, motorcycles, and surfboards) and handheld objects (such as laptops, tennis rackets, and skateboards). We quantify the ability of our approach to recover human-object arrangements and outline remaining challenges in this relatively domain. The project webpage can be found at this https URL.

3.Rewriting a Deep Generative Model ⬇️

A deep generative model such as a GAN learns to model a rich set of semantic and physical rules about the target distribution, but up to now, it has been obscure how such rules are encoded in the network, or how a rule could be changed. In this paper, we introduce a new problem setting: manipulation of specific rules encoded by a deep generative model. To address the problem, we propose a formulation in which the desired rule is changed by manipulating a layer of a deep network as a linear associative memory. We derive an algorithm for modifying one entry of the associative memory, and we demonstrate that several interesting structural rules can be located and modified within the layers of state-of-the-art generative models. We present a user interface to enable users to interactively change the rules of a generative model to achieve desired effects, and we show several proof-of-concept applications. Finally, results on multiple datasets demonstrate the advantage of our method against standard fine-tuning methods and edit transfer algorithms.

4.LevelSet R-CNN: A Deep Variational Method for Instance Segmentation ⬇️

Obtaining precise instance segmentation masks is of high importance in many modern applications such as robotic manipulation and autonomous driving. Currently, many state of the art models are based on the Mask R-CNN framework which, while very powerful, outputs masks at low resolutions which could result in imprecise boundaries. On the other hand, classic variational methods for segmentation impose desirable global and local data and geometry constraints on the masks by optimizing an energy functional. While mathematically elegant, their direct dependence on good initialization, non-robust image cues and manual setting of hyperparameters renders them unsuitable for modern applications. We propose LevelSet R-CNN, which combines the best of both worlds by obtaining powerful feature representations that are combined in an end-to-end manner with a variational segmentation framework. We demonstrate the effectiveness of our approach on COCO and Cityscapes datasets.

5.Unsupervised Continuous Object Representation Networks for Novel View Synthesis ⬇️

Novel View Synthesis (NVS) is concerned with the generation of novel views of a scene from one or more source images. NVS requires explicit reasoning about 3D object structure and unseen parts of the scene. As a result, current approaches rely on supervised training with either 3D models or multiple target images. We present Unsupervised Continuous Object Representation Networks (UniCORN), which encode the geometry and appearance of a 3D scene using a neural 3D representation. Our model is trained with only two source images per object, requiring no ground truth 3D models or target view supervision. Despite being unsupervised, UniCORN achieves comparable results to the state-of-the-art on challenging tasks, including novel view synthesis and single-view 3D reconstruction.

6.Multi-label Zero-shot Classification by Learning to Transfer from External Knowledge ⬇️

Multi-label zero-shot classification aims to predict multiple unseen class labels for an input image. It is more challenging than its single-label counterpart. On one hand, the unconstrained number of labels assigned to each image makes the model more easily overfit to those seen classes. On the other hand, there is a large semantic gap between seen and unseen classes in the existing multi-label classification datasets. To address these difficult issues, this paper introduces a novel multi-label zero-shot classification framework by learning to transfer from external knowledge. We observe that ImageNet is commonly used to pretrain the feature extractor and has a large and fine-grained label space. This motivates us to exploit it as external knowledge to bridge the seen and unseen classes and promote generalization. Specifically, we construct a knowledge graph including not only classes from the target dataset but also those from ImageNet. Since ImageNet labels are not available in the target dataset, we propose a novel PosVAE module to infer their initial states in the extended knowledge graph. Then we design a relational graph convolutional network (RGCN) to propagate information among classes and achieve knowledge transfer. Experimental results on two benchmark datasets demonstrate the effectiveness of the proposed approach.

7.Heatmap-based Vanishing Point boosts Lane Detection ⬇️

Vision-based lane detection (LD) is a key part of autonomous driving technology, and it is also a challenging problem. As one of the important constraints of scene composition, vanishing point (VP) may provide a useful clue for lane detection. In this paper, we proposed a new multi-task fusion network architecture for high-precision lane detection. Firstly, the ERFNet was used as the backbone to extract the hierarchical features of the road image. Then, the lanes were detected using image segmentation. Finally, combining the output of lane detection and the hierarchical features extracted by the backbone, the lane VP was predicted using heatmap regression. The proposed fusion strategy was tested using the public CULane dataset. The experimental results suggest that the lane detection accuracy of our method outperforms those of state-of-the-art (SOTA) methods.

8.Dense Scene Multiple Object Tracking with Box-Plane Matching ⬇️

Multiple Object Tracking (MOT) is an important task in computer vision. MOT is still challenging due to the occlusion problem, especially in dense scenes. Following the tracking-by-detection framework, we propose the Box-Plane Matching (BPM) method to improve the MOT performacne in dense scenes. First, we design the Layer-wise Aggregation Discriminative Model (LADM) to filter the noisy detections. Then, to associate remaining detections correctly, we introduce the Global Attention Feature Model (GAFM) to extract appearance feature and use it to calculate the appearance similarity between history tracklets and current detections. Finally, we propose the Box-Plane Matching strategy to achieve data association according to the motion similarity and appearance similarity between tracklets and detections. With the effectiveness of the three modules, our team achieves the 1st place on the Track-1 leaderboard in the ACM MM Grand Challenge HiEve 2020.

9.Unsupervised Disentanglement GAN for Domain Adaptive Person Re-Identification ⬇️

While recent person re-identification (ReID) methods achieve high accuracy in a supervised setting, their generalization to an unlabelled domain is still an open problem. In this paper, we introduce a novel unsupervised disentanglement generative adversarial network (UD-GAN) to address the domain adaptation issue of supervised person ReID. Our framework jointly trains a ReID network for discriminative features extraction in a source labelled domain using identity annotation, and adapts the ReID model to an unlabelled target domain by learning disentangled latent representations on the domain. Identity-unrelated features in the target domain are distilled from the latent features. As a result, the ReID features better encompass the identity of a person in the unsupervised domain. We conducted experiments on the Market1501, DukeMTMC and MSMT17 datasets. Results show that the unsupervised domain adaptation problem in ReID is very challenging. Nevertheless, our method shows improvement in half of the domain transfers and achieve state-of-the-art performance for one of them.

10.Quantitative Distortion Analysis of Flattening Applied to the Scroll from En-Gedi ⬇️

Non-invasive volumetric imaging can now capture the internal structure and detailed evidence of ink-based writing from within the confines of damaged and deteriorated manuscripts that cannot be physically opened. As demonstrated recently on the En-Gedi scroll, our "virtual unwrapping" software pipeline enables the recovery of substantial ink-based text from damaged artifacts at a quality high enough for serious critical textual analysis. However, the quality of the resulting images is defined by the subjective evaluation of scholars, and a choice of specific algorithms and parameters must be available at each stage in the pipeline in order to maximize the output quality.

11.Event-based Stereo Visual Odometry ⬇️

Event-based cameras are bio-inspired vision sensors whose pixels work independently from each other and respond asynchronously to brightness changes, with microsecond resolution. Their advantages make it possible to tackle challenging scenarios in robotics, such as high-speed and high dynamic range scenes. We present a solution to the problem of visual odometry from the data acquired by a stereo event-based camera rig. Our system follows a parallel tracking-and-mapping approach, where novel solutions to each subproblem (3D reconstruction and camera pose estimation) are developed with two objectives in mind: being principled and efficient, for real-time operation with commodity hardware. To this end, we seek to maximize the spatio-temporal consistency of stereo event-based data while using a simple and efficient representation. Specifically, the mapping module builds a semi-dense 3D map of the scene by fusing depth estimates from multiple local viewpoints (obtained by spatio-temporal consistency) in a probabilistic fashion. The tracking module recovers the pose of the stereo rig by solving a registration problem that naturally arises due to the chosen map and event data representation. Experiments on publicly available datasets and on our own recordings demonstrate the versatility of the proposed method in natural scenes with general 6-DoF motion. The system successfully leverages the advantages of event-based cameras to perform visual odometry in challenging illumination conditions, such as low-light and high dynamic range, while running in real-time on a standard CPU. We release the software and dataset under an open source licence to foster research in the emerging topic of event-based SLAM.

12.Epipolar-Guided Deep Object Matching for Scene Change Detection ⬇️

This paper describes a viewpoint-robust object-based change detection network (OBJ-CDNet). Mobile cameras such as drive recorders capture images from different viewpoints each time due to differences in camera trajectory and shutter timing. However, previous methods for pixel-wise change detection are vulnerable to the viewpoint differences because they assume aligned image pairs as inputs. To cope with the difficulty, we introduce a deep graph matching network that establishes object correspondence between an image pair. The introduction enables us to detect object-wise scene changes without precise image alignment. For more accurate object matching, we propose an epipolar-guided deep graph matching network (EGMNet), which incorporates the epipolar constraint into the deep graph matching layer used in OBJCDNet. To evaluate our network's robustness against viewpoint differences, we created synthetic and real datasets for scene change detection from an image pair. The experimental results verified the effectiveness of our network.

13.A new Local Radon Descriptor for Content-Based Image Search ⬇️

Content-based image retrieval (CBIR) is an essential part of computer vision research, especially in medical expert systems. Having a discriminative image descriptor with the least number of parameters for tuning is desirable in CBIR systems. In this paper, we introduce a new simple descriptor based on the histogram of local Radon projections. We also propose a very fast convolution-based local Radon estimator to overcome the slow process of Radon projections. We performed our experiments using pathology images (KimiaPath24) and lung CT patches and test our proposed solution for medical image processing. We achieved superior results compared with other histogram-based descriptors such as LBP and HoG as well as some pre-trained CNNs.

14.SimPose: Effectively Learning DensePose and Surface Normals of People from Simulated Data ⬇️

With a proliferation of generic domain-adaptation approaches, we report a simple yet effective technique for learning difficult per-pixel 2.5D and 3D regression representations of articulated people. We obtained strong sim-to-real domain generalization for the 2.5D DensePose estimation task and the 3D human surface normal estimation task. On the multi-person DensePose MSCOCO benchmark, our approach outperforms the state-of-the-art methods which are trained on real images that are densely labelled. This is an important result since obtaining human manifold's intrinsic uv coordinates on real images is time consuming and prone to labeling noise. Additionally, we present our model's 3D surface normal predictions on the MSCOCO dataset that lacks any real 3D surface normal labels. The key to our approach is to mitigate the "Inter-domain Covariate Shift" with a carefully selected training batch from a mixture of domain samples, a deep batch-normalized residual network, and a modified multi-task learning objective. Our approach is complementary to existing domain-adaptation techniques and can be applied to other dense per-pixel pose estimation problems.

15.Cascaded Non-local Neural Network for Point Cloud Semantic Segmentation ⬇️

In this paper, we propose a cascaded non-local neural network for point cloud segmentation. The proposed network aims to build the long-range dependencies of point clouds for the accurate segmentation. Specifically, we develop a novel cascaded non-local module, which consists of the neighborhood-level, superpoint-level and global-level non-local blocks. First, in the neighborhood-level block, we extract the local features of the centroid points of point clouds by assigning different weights to the neighboring points. The extracted local features of the centroid points are then used to encode the superpoint-level block with the non-local operation. Finally, the global-level block aggregates the non-local features of the superpoints for semantic segmentation in an encoder-decoder framework. Benefiting from the cascaded structure, geometric structure information of different neighborhoods with the same label can be propagated. In addition, the cascaded structure can largely reduce the computational cost of the original non-local operation on point clouds. Experiments on different indoor and outdoor datasets show that our method achieves state-of-the-art performance and effectively reduces the time consumption and memory occupation.

16.Revisiting the Modifiable Areal Unit Problem in Deep Traffic Prediction with Visual Analytics ⬇️

Deep learning methods are being increasingly used for urban traffic prediction where spatiotemporal traffic data is aggregated into sequentially organized matrices that are then fed into convolution-based residual neural networks. However, the widely known modifiable areal unit problem within such aggregation processes can lead to perturbations in the network inputs. This issue can significantly destabilize the feature embeddings and the predictions, rendering deep networks much less useful for the experts. This paper approaches this challenge by leveraging unit visualization techniques that enable the investigation of many-to-many relationships between dynamically varied multi-scalar aggregations of urban traffic data and neural network predictions. Through regular exchanges with a domain expert, we design and develop a visual analytics solution that integrates 1) a Bivariate Map equipped with an advanced bivariate colormap to simultaneously depict input traffic and prediction errors across space, 2) a Morans I Scatterplot that provides local indicators of spatial association analysis, and 3) a Multi-scale Attribution View that arranges non-linear dot plots in a tree layout to promote model analysis and comparison across scales. We evaluate our approach through a series of case studies involving a real-world dataset of Shenzhen taxi trips, and through interviews with domain experts. We observe that geographical scale variations have important impact on prediction performances, and interactive visual exploration of dynamically varying inputs and outputs benefit experts in the development of deep traffic prediction models.

17.Learning from Few Samples: A Survey ⬇️

Deep neural networks have been able to outperform humans in some cases like image recognition and image classification. However, with the emergence of various novel categories, the ability to continuously widen the learning capability of such networks from limited samples, still remains a challenge. Techniques like Meta-Learning and/or few-shot learning showed promising results, where they can learn or generalize to a novel category/task based on prior knowledge. In this paper, we perform a study of the existing few-shot meta-learning techniques in the computer vision domain based on their method and evaluation metrics. We provide a taxonomy for the techniques and categorize them as data-augmentation, embedding, optimization and semantics based learning for few-shot, one-shot and zero-shot settings. We then describe the seminal work done in each category and discuss their approach towards solving the predicament of learning from few samples. Lastly we provide a comparison of these techniques on the commonly used benchmark datasets: Omniglot, and MiniImagenet, along with a discussion towards the future direction of improving the performance of these techniques towards the final goal of outperforming humans.

18.$grid2vec$: Learning Efficient Visual Representations via Flexible Grid-Graphs ⬇️

We propose $grid2vec$, a novel approach for image representation learning based on Graph Convolutional Network (GCN). Existing visual representation methods suffer from several issues, such as requiring high-computation, losing in-depth structures, and being restricted to specific objects. $grid2vec$ converts an image to a low-dimensional feature vector. A key component of $grid2vec$ is Flexible Grid-Graphs, a spatially-adaptive method based on the image key-points, as a flexible grid, to generate the graph representation. It represents each image with a graph of unique node locations and edge distances. Nodes, in Flexible Grid-Graphs, describe the most representative patches in the image. We develop a multi-channel Convolutional Neural Network architecture to learn local features of each patch. We implement a hybrid node-embedding method, i.e., having spectral and non-spectral components. It aggregates the products of neighbours' features and node's eigenvector centrality score. We compare the performance of $grid2vec$ with a set of state-of-the-art representation learning and visual recognition models. $grid2vec$ has only $512$ features in comparison to a range from VGG16 with $25,090$ to NASNet with $487,874$. We show the models' superior accuracy in both binary and multi-class image classification. Although we utilise imbalanced, low-size dataset, $grid2vec$ shows stable and superior results against the well-known base classifiers.

19.Label or Message: A Large-Scale Experimental Survey of Texts and Objects Co-Occurrence ⬇️

Our daily life is surrounded by textual information. Nowadays, the automatic collection of textual information becomes possible owing to the drastic improvement of scene text detectors and recognizer. The purpose of this paper is to conduct a large-scale survey of co-occurrence between visual objects (such as book and car) and scene texts with a large image dataset and a state-of-the-art scene text detector and recognizer. Especially, we focus on the function of "label" texts, which are attached to objects for detailing the objects. By analyzing co-occurrence between objects and scene texts, it is possible to observe the statistics about the label texts and understand how the scene texts will be useful for recognizing the objects and vice versa.

20.NormalGAN: Learning Detailed 3D Human from a Single RGB-D Image ⬇️

We propose NormalGAN, a fast adversarial learning-based method to reconstruct the complete and detailed 3D human from a single RGB-D image. Given a single front-view RGB-D image, NormalGAN performs two steps: front-view RGB-D rectification and back-view RGBD inference. The final model was then generated by simply combining the front-view and back-view RGB-D information. However, inferring backview RGB-D image with high-quality geometric details and plausible texture is not trivial. Our key observation is: Normal maps generally encode much more information of 3D surface details than RGB and depth images. Therefore, learning geometric details from normal maps is superior than other representations. In NormalGAN, an adversarial learning framework conditioned by normal maps is introduced, which is used to not only improve the front-view depth denoising performance, but also infer the back-view depth image with surprisingly geometric details. Moreover, for texture recovery, we remove shading information from the front-view RGB image based on the refined normal map, which further improves the quality of the back-view color inference. Results and experiments on both testing data set and real captured data demonstrate the superior performance of our approach. Given a consumer RGB-D sensor, NormalGAN can generate the complete and detailed 3D human reconstruction results in 20 fps, which further enables convenient interactive experiences in telepresence, AR/VR and gaming scenarios.

21.Infrastructure-based Multi-Camera Calibration using Radial Projections ⬇️

Multi-camera systems are an important sensor platform for intelligent systems such as self-driving cars. Pattern-based calibration techniques can be used to calibrate the intrinsics of the cameras individually. However, extrinsic calibration of systems with little to no visual overlap between the cameras is a challenge. Given the camera intrinsics, infrastucture-based calibration techniques are able to estimate the extrinsics using 3D maps pre-built via SLAM or Structure-from-Motion. In this paper, we propose to fully calibrate a multi-camera system from scratch using an infrastructure-based approach. Assuming that the distortion is mainly radial, we introduce a two-stage approach. We first estimate the camera-rig extrinsics up to a single unknown translation component per camera. Next, we solve for both the intrinsic parameters and the missing translation components. Extensive experiments on multiple indoor and outdoor scenes with multiple multi-camera systems show that our calibration method achieves high accuracy and robustness. In particular, our approach is more robust than the naive approach of first estimating intrinsic parameters and pose per camera before refining the extrinsic parameters of the system. The implementation is available at this https URL.

22.Searching Collaborative Agents for Multi-plane Localization in 3D Ultrasound ⬇️

3D ultrasound (US) is widely used due to its rich diagnostic information, portability and low cost. Automated standard plane (SP) localization in US volume not only improves efficiency and reduces user-dependence, but also boosts 3D US interpretation. In this study, we propose a novel Multi-Agent Reinforcement Learning (MARL) framework to localize multiple uterine SPs in 3D US simultaneously. Our contribution is two-fold. First, we equip the MARL with a one-shot neural architecture search (NAS) module to obtain the optimal agent for each plane. Specifically, Gradient-based search using Differentiable Architecture Sampler (GDAS) is employed to accelerate and stabilize the training process. Second, we propose a novel collaborative strategy to strengthen agents' communication. Our strategy uses recurrent neural network (RNN) to learn the spatial relationship among SPs effectively. Extensively validated on a large dataset, our approach achieves the accuracy of 7.05 degree/2.21mm, 8.62 degree/2.36mm and 5.93 degree/0.89mm for the mid-sagittal, transverse and coronal plane localization, respectively. The proposed MARL framework can significantly increase the plane localization accuracy and reduce the computational cost and model size.

23.Dynamic texture analysis for detecting fake faces in video sequences ⬇️

The creation of manipulated multimedia content involving human characters has reached in the last years unprecedented realism, calling for automated techniques to expose synthetically generated faces in images and videos. This work explores the analysis of spatio-temporal texture dynamics of the video signal, with the goal of characterizing and distinguishing real and fake sequences. We propose to build a binary decision on the joint analysis of multiple temporal segments and, in contrast to previous approaches, to exploit the textural dynamics of both the spatial and temporal dimensions. This is achieved through the use of Local Derivative Patterns on Three Orthogonal Planes (LDP-TOP), a compact feature representation known to be an important asset for the detection of face spoofing attacks. Experimental analyses on state-of-the-art datasets of manipulated videos show the discriminative power of such descriptors in separating real and fake sequences, and also identifying the creation method used. Linear Support Vector Machines (SVMs) are used which, despite the lower complexity, yield comparable performance to previously proposed deep models for fake content detection.

24.The Blessing and the Curse of the Noise behind Facial Landmark Annotations ⬇️

The evolving algorithms for 2D facial landmark detection empower people to recognize faces, analyze facial expressions, etc. However, existing methods still encounter problems of unstable facial landmarks when applied to videos. Because previous research shows that the instability of facial landmarks is caused by the inconsistency of labeling quality among the public datasets, we want to have a better understanding of the influence of annotation noise in them. In this paper, we make the following contributions: 1) we propose two metrics that quantitatively measure the stability of detected facial landmarks, 2) we model the annotation noise in an existing public dataset, 3) we investigate the influence of different types of noise in training face alignment neural networks, and propose corresponding solutions. Our results demonstrate improvements in both accuracy and stability of detected facial landmarks.

25.Weakly-Supervised Cell Tracking via Backward-and-Forward Propagation ⬇️

We propose a weakly-supervised cell tracking method that can train a convolutional neural network (CNN) by using only the annotation of "cell detection" (i.e., the coordinates of cell positions) without association information, in which cell positions can be easily obtained by nuclear staining. First, we train co-detection CNN that detects cells in successive frames by using weak-labels. Our key assumption is that co-detection CNN implicitly learns association in addition to detection. To obtain the association, we propose a backward-and-forward propagation method that analyzes the correspondence of cell positions in the outputs of co-detection CNN. Experiments demonstrated that the proposed method can associate cells by analyzing co-detection CNN. Even though the method uses only weak supervision, the performance of our method was almost the same as the state-of-the-art supervised method. Code is publicly available in this https URL

26.Instance Selection for GANs ⬇️

Recent advances in Generative Adversarial Networks (GANs) have led to their widespread adoption for the purposes of generating high quality synthetic imagery. While capable of generating photo-realistic images, these models often produce unrealistic samples which fall outside of the data manifold. Several recently proposed techniques attempt to avoid spurious samples, either by rejecting them after generation, or by truncating the model's latent space. While effective, these methods are inefficient, as large portions of model capacity are dedicated towards representing samples that will ultimately go unused. In this work we propose a novel approach to improve sample quality: altering the training dataset via instance selection before model training has taken place. To this end, we embed data points into a perceptual feature space and use a simple density model to remove low density regions from the data manifold. By refining the empirical data distribution before training we redirect model capacity towards high-density regions, which ultimately improves sample fidelity. We evaluate our method by training a Self-Attention GAN on ImageNet at 64x64 resolution, where we outperform the current state-of-the-art models on this task while using 1/2 of the parameters. We also highlight training time savings by training a BigGAN on ImageNet at 128x128 resolution, achieving a 66% increase in Inception Score and a 16% improvement in FID over the baseline model with less than 1/4 the training time.

27.Hierarchical Action Classification with Network Pruning ⬇️

Research on human action classification has made significant progresses in the past few years. Most deep learning methods focus on improving performance by adding more network components. We propose, however, to better utilize auxiliary mechanisms, including hierarchical classification, network pruning, and skeleton-based preprocessing, to boost the model robustness and performance. We test the effectiveness of our method on four commonly used testing datasets: NTU RGB+D 60, NTU RGB+D 120, Northwestern-UCLA Multiview Action 3D, and UTD Multimodal Human Action Dataset. Our experiments show that our method can achieve either comparable or better performance on all four datasets. In particular, our method sets up a new baseline for NTU 120, the largest dataset among the four. We also analyze our method with extensive comparisons and ablation studies.

28.Weakly Supervised Minirhizotron Image Segmentation with MIL-CAM ⬇️

We present a multiple instance learning class activation map (MIL-CAM) approach for pixel-level minirhizotron image segmentation given weak image-level labels. Minirhizotrons are used to image plant roots in situ. Minirhizotron imagery is often composed of soil containing a few long and thin root objects of small diameter. The roots prove to be challenging for existing semantic image segmentation methods to discriminate. In addition to learning from weak labels, our proposed MIL-CAM approach re-weights the root versus soil pixels during analysis for improved performance due to the heavy imbalance between soil and root pixels. The proposed approach outperforms other attention map and multiple instance learning methods for localization of root objects in minirhizotron imagery.

29.Action2Motion: Conditioned Generation of 3D Human Motions ⬇️

Action recognition is a relatively established task, where givenan input sequence of human motion, the goal is to predict its ac-tion category. This paper, on the other hand, considers a relativelynew problem, which could be thought of as an inverse of actionrecognition: given a prescribed action type, we aim to generateplausible human motion sequences in 3D. Importantly, the set ofgenerated motions are expected to maintain itsdiversityto be ableto explore the entire action-conditioned motion space; meanwhile,each sampled sequence faithfully resembles anaturalhuman bodyarticulation dynamics. Motivated by these objectives, we followthe physics law of human kinematics by adopting the Lie Algebratheory to represent thenaturalhuman motions; we also propose atemporal Variational Auto-Encoder (VAE) that encourages adiversesampling of the motion space. A new 3D human motion dataset, HumanAct12, is also constructed. Empirical experiments overthree distinct human motion datasets (including ours) demonstratethe effectiveness of our approach.

30.Detecting Suspicious Behavior: How to Deal with Visual Similarity through Neural Networks ⬇️

Suspicious behavior is likely to threaten security, assets, life, or freedom. This behavior has no particular pattern, which complicates the tasks to detect it and define it. Even for human observers, it is complex to spot suspicious behavior in surveillance videos. Some proposals to tackle abnormal and suspicious behavior-related problems are available in the literature. However, they usually suffer from high false-positive rates due to different classes with high visual similarity. The Pre-Crime Behavior method removes information related to a crime commission to focus on suspicious behavior before the crime happens. The resulting samples from different types of crime have a high-visual similarity with normal-behavior samples. To address this problem, we implemented 3D Convolutional Neural Networks and trained them under different approaches. Also, we tested different values in the number-of-filter parameter to optimize computational resources. Finally, the comparison between the performance using different training approaches shows the best option to improve the suspicious behavior detection on surveillance videos.

31.Key Frame Proposal Network for Efficient Pose Estimation in Videos ⬇️

Human pose estimation in video relies on local information by either estimating each frame independently or tracking poses across frames. In this paper, we propose a novel method combining local approaches with global context. We introduce a light weighted, unsupervised, key frame proposal network (K-FPN) to select informative frames and a learned dictionary to recover the entire pose sequence from these frames. The K-FPN speeds up the pose estimation and provides robustness to bad frames with occlusion, motion blur, and illumination changes, while the learned dictionary provides global dynamic context. Experiments on Penn Action and sub-JHMDB datasets show that the proposed method achieves state-of-the-art accuracy, with substantial speed-up.

32.Crowdsampling the Plenoptic Function ⬇️

Many popular tourist landmarks are captured in a multitude of online, public photos. These photos represent a sparse and unstructured sampling of the plenoptic function for a particular scene. In this paper,we present a new approach to novel view synthesis under time-varying illumination from such data. Our approach builds on the recent multi-plane image (MPI) format for representing local light fields under fixed viewing conditions. We introduce a new DeepMPI representation, motivated by observations on the sparsity structure of the plenoptic function, that allows for real-time synthesis of photorealistic views that are continuous in both space and across changes in lighting. Our method can synthesize the same compelling parallax and view-dependent effects as previous MPI methods, while simultaneously interpolating along changes in reflectance and illumination with time. We show how to learn a model of these effects in an unsupervised way from an unstructured collection of photos without temporal registration, demonstrating significant improvements over recent work in neural rendering. More information can be found this http URL.

33.Domain Adaptive Semantic Segmentation Using Weak Labels ⬇️

Learning semantic segmentation models requires a huge amount of pixel-wise labeling. However, labeled data may only be available abundantly in a domain different from the desired target domain, which only has minimal or no annotations. In this work, we propose a novel framework for domain adaptation in semantic segmentation with image-level weak labels in the target domain. The weak labels may be obtained based on a model prediction for unsupervised domain adaptation (UDA), or from a human annotator in a new weakly-supervised domain adaptation (WDA) paradigm for semantic segmentation. Using weak labels is both practical and useful, since (i) collecting image-level target annotations is comparably cheap in WDA and incurs no cost in UDA, and (ii) it opens the opportunity for category-wise domain alignment. Our framework uses weak labels to enable the interplay between feature alignment and pseudo-labeling, improving both in the process of domain adaptation. Specifically, we develop a weak-label classification module to enforce the network to attend to certain categories, and then use such training signals to guide the proposed category-wise alignment method. In experiments, we show considerable improvements with respect to the existing state-of-the-arts in UDA and present a new benchmark in the WDA setting. Project page is at this http URL.

34.An Improvement for Capsule Networks using Depthwise Separable Convolution ⬇️

Capsule Networks face a critical problem in computer vision in the sense that the image background can challenge its performance, although they learn very well on training data. In this work, we propose to improve Capsule Networks' architecture by replacing the Standard Convolution with a Depthwise Separable Convolution. This new design significantly reduces the model's total parameters while increases stability and offers competitive accuracy. In addition, the proposed model on $64\times64$ pixel images outperforms standard models on $32\times32$ and $64\times64$ pixel images. Moreover, we empirically evaluate these models with Deep Learning architectures using state-of-the-art Transfer Learning networks such as Inception V3 and MobileNet V1. The results show that Capsule Networks perform equivalently against Deep Learning models. To the best of our knowledge, we believe that this is the first work on the integration of Depthwise Separable Convolution into Capsule Networks.

35.Rethinking Recurrent Neural Networks and other Improvements for Image Classification ⬇️

For a long history of Machine Learning which dates back to several decades, Recurrent Neural Networks (RNNs) have been mainly used for sequential data and time series or generally 1D information. Even in some rare researches on 2D images, the networks merely learn and generate data sequentially rather than for recognition of images. In this research, we propose to integrate RNN as an additional layer in designing image recognition's models. Moreover, we develop End-to-End Ensemble Multi-models that are able to learn experts' predictions from several models. Besides, we extend training strategy and softmax pruning which overall leads our designs to perform comparably to top models on several datasets. The source code of the methods provided in this article is available in this https URL and this http URL.

36.Benchmarking and Comparing Multi-exposure Image Fusion Algorithms ⬇️

Multi-exposure image fusion (MEF) is an important area in computer vision and has attracted increasing interests in recent years. Apart from conventional algorithms, deep learning techniques have also been applied to multi-exposure image fusion. However, although much efforts have been made on developing MEF algorithms, the lack of benchmark makes it difficult to perform fair and comprehensive performance comparison among MEF algorithms, thus significantly hindering the development of this field. In this paper, we fill this gap by proposing a benchmark for multi-exposure image fusion (MEFB) which consists of a test set of 100 image pairs, a code library of 16 algorithms, 20 evaluation metrics, 1600 fused images and a software toolkit. To the best of our knowledge, this is the first benchmark in the field of multi-exposure image fusion. Extensive experiments have been conducted using MEFB for comprehensive performance evaluation and for identifying effective algorithms. We expect that MEFB will serve as an effective platform for researchers to compare performances and investigate MEF algorithms.

37.Fully Dynamic Inference with Deep Neural Networks ⬇️

Modern deep neural networks are powerful and widely applicable models that extract task-relevant information through multi-level abstraction. Their cross-domain success, however, is often achieved at the expense of computational cost, high memory bandwidth, and long inference latency, which prevents their deployment in resource-constrained and time-sensitive scenarios, such as edge-side inference and self-driving cars. While recently developed methods for creating efficient deep neural networks are making their real-world deployment more feasible by reducing model size, they do not fully exploit input properties on a per-instance basis to maximize computational efficiency and task accuracy. In particular, most existing methods typically use a one-size-fits-all approach that identically processes all inputs. Motivated by the fact that different images require different feature embeddings to be accurately classified, we propose a fully dynamic paradigm that imparts deep convolutional neural networks with hierarchical inference dynamics at the level of layers and individual convolutional filters/channels. Two compact networks, called Layer-Net (L-Net) and Channel-Net (C-Net), predict on a per-instance basis which layers or filters/channels are redundant and therefore should be skipped. L-Net and C-Net also learn how to scale retained computation outputs to maximize task accuracy. By integrating L-Net and C-Net into a joint design framework, called LC-Net, we consistently outperform state-of-the-art dynamic frameworks with respect to both efficiency and classification accuracy. On the CIFAR-10 dataset, LC-Net results in up to 11.9$\times$ fewer floating-point operations (FLOPs) and up to 3.3% higher accuracy compared to other dynamic inference methods. On the ImageNet dataset, LC-Net achieves up to 1.4$\times$ fewer FLOPs and up to 4.6% higher Top-1 accuracy than the other methods.

38.Single Image Cloud Detection via Multi-Image Fusion ⬇️

Artifacts in imagery captured by remote sensing, such as clouds, snow, and shadows, present challenges for various tasks, including semantic segmentation and object detection. A primary challenge in developing algorithms for identifying such artifacts is the cost of collecting annotated training data. In this work, we explore how recent advances in multi-image fusion can be leveraged to bootstrap single image cloud detection. We demonstrate that a network optimized to estimate image quality also implicitly learns to detect clouds. To support the training and evaluation of our approach, we collect a large dataset of Sentinel-2 images along with a per-pixel semantic labelling for land cover. Through various experiments, we demonstrate that our method reduces the need for annotated training data and improves cloud detection performance.

39.Learning To Pay Attention To Mistakes ⬇️

As evidenced in visual results in \cite{attenUnet}\cite{AttenUnet2018}\cite{InterSeg}\cite{UnetFrontNeuro}\cite{LearnActiveContour}, the periphery of foreground regions representing malignant tissues may be disproportionately assigned as belonging to the background class of healthy tissues in medical image segmentation. This leads to high false negative detection rates. In this paper, we propose a novel attention mechanism to directly address such high false negative rates, called Paying Attention to Mistakes. Our attention mechanism steers the models towards false positive identification, which counters the existing bias towards false negatives. The proposed mechanism has two complementary implementations: (a) "explicit" steering of the model to attend to a larger Effective Receptive Field on the foreground areas; (b) "implicit" steering towards false positives, by attending to a smaller Effective Receptive Field on the background areas. We validated our methods on three tasks: 1) binary dense prediction between vehicles and the background using CityScapes; 2) Enhanced Tumour Core segmentation with multi-modal MRI scans in BRATS2018; 3) segmenting stroke lesions using ultrasound images in ISLES2018. We compared our methods with state-of-the-art attention mechanisms in medical imaging, including self-attention, spatial-attention and spatial-channel mixed attention. Across all of the three different tasks, our models consistently outperform the baseline models in Intersection over Union (IoU) and/or Hausdorff Distance (HD). For instance, in the second task, the "explicit" implementation of our mechanism reduces the HD of the best baseline by more than $26%$, whilst improving the IoU by more than $3%$. We believe our proposed attention mechanism can benefit a wide range of medical and computer vision tasks, which suffer from over-detection of background.

40.Foveation for Segmentation of Ultra-High Resolution Images ⬇️

Segmentation of ultra-high resolution images is challenging because of their enormous size, consisting of millions or even billions of pixels. Typical solutions include dividing input images into patches of fixed size and/or down-sampling to meet memory constraints. Such operations incur information loss in the field-of-view (FoV) i.e., spatial coverage and the image resolution. The impact on segmentation performance is, however, as yet understudied. In this work, we start with a motivational experiment which demonstrates that the trade-off between FoV and resolution affects the segmentation performance on ultra-high resolution images---and furthermore, its influence also varies spatially according to the local patterns in different areas. We then introduce foveation module, a learnable "dataloader" which, for a given ultra-high resolution image, adaptively chooses the appropriate configuration (FoV/resolution trade-off) of the input patch to feed to the downstream segmentation model at each spatial location of the image. The foveation module is jointly trained with the segmentation network to maximise the task performance. We demonstrate on three publicly available high-resolution image datasets that the foveation module consistently improves segmentation performance over the cases trained with patches of fixed FoV/resolution trade-off. Our approach achieves the SoTA performance on the DeepGlobe aerial image dataset. On the Gleason2019 histopathology dataset, our model achieves better segmentation accuracy for the two most clinically important and ambiguous classes (Gleason Grade 3 and 4) than the top performers in the challenge by 13.1% and 7.5%, and improves on the average performance of 6 human experts by 6.5% and 7.5%. Our code and trained models are available at this https URL.

41.Deep Keypoint-Based Camera Pose Estimation with Geometric Constraints ⬇️

Estimating relative camera poses from consecutive frames is a fundamental problem in visual odometry (VO) and simultaneous localization and mapping (SLAM), where classic methods consisting of hand-crafted features and sampling-based outlier rejection have been a dominant choice for over a decade. Although multiple works propose to replace these modules with learning-based counterparts, most have not yet been as accurate, robust and generalizable as conventional methods. In this paper, we design an end-to-end trainable framework consisting of learnable modules for detection, feature extraction, matching and outlier rejection, while directly optimizing for the geometric pose objective. We show both quantitatively and qualitatively that pose estimation performance may be achieved on par with the classic pipeline. Moreover, we are able to show by end-to-end training, the key components of the pipeline could be significantly improved, which leads to better generalizability to unseen datasets compared to existing learning-based methods.

42.Outlier-Robust Estimation: Hardness, Minimally-Tuned Algorithms, and Applications ⬇️

Nonlinear estimation in robotics and vision is typically plagued with outliers due to wrong data association, or to incorrect detections from signal processing and machine learning methods. This paper introduces two unifying formulations for outlier-robust estimation, Generalized Maximum Consensus (G- MC) and Generalized Truncated Least Squares (G-TLS), and investigates fundamental limits, practical algorithms, and applications. Our first contribution is a proof that outlier-robust estimation is inapproximable: in the worst case, it is impossible to (even approximately) find the set of outliers, even with slower-than-polynomial-time algorithms (particularly, algorithms running in quasi-polynomial time). As a second contribution, we review and extend two general-purpose algorithms. The first, Adaptive Trimming (ADAPT), is combinatorial, and is suitable for G-MC; the second, Graduated Non-Convexity (GNC), is based on homotopy methods, and is suitable for G-TLS. We extend ADAPT and GNC to the case where the user does not have prior knowledge of the inlier-noise statistics (or the statistics may vary over time) and is unable to guess a reasonable threshold to separate inliers from outliers (as the one commonly used in RANSAC). We propose the first minimally-tuned algorithms for outlier rejection, that dynamically decide how to separate inliers from outliers. Our third contribution is an evaluation of the proposed algorithms on robot perception problems: mesh registration, image-based object detection (shape alignment), and pose graph optimization. ADAPT and GNC execute in real-time, are deterministic, outperform RANSAC, and are robust to 70-90% outliers. Their minimally-tuned versions also compare favorably with the state of the art, even though they do not rely on a noise bound for the inliers.

43.Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval ⬇️

Sketch as an image search query is an ideal alternative to text in capturing the fine-grained visual details. Prior successes on fine-grained sketch-based image retrieval (FG-SBIR) have demonstrated the importance of tackling the unique traits of sketches as opposed to photos, e.g., temporal vs. static, strokes vs. pixels, and abstract vs. pixel-perfect. In this paper, we study a further trait of sketches that has been overlooked to date, that is, they are hierarchical in terms of the levels of detail -- a person typically sketches up to various extents of detail to depict an object. This hierarchical structure is often visually distinct. In this paper, we design a novel network that is capable of cultivating sketch-specific hierarchies and exploiting them to match sketch with photo at corresponding hierarchical levels. In particular, features from a sketch and a photo are enriched using cross-modal co-attention, coupled with hierarchical node fusion at every level to form a better embedding space to conduct retrieval. Experiments on common benchmarks show our method to outperform state-of-the-arts by a significant margin.

44.Unselfie: Translating Selfies to Neutral-pose Portraits in the Wild ⬇️

Due to the ubiquity of smartphones, it is popular to take photos of one's self, or "selfies." Such photos are convenient to take, because they do not require specialized equipment or a third-party photographer. However, in selfies, constraints such as human arm length often make the body pose look unnatural. To address this issue, we introduce $\textit{unselfie}$, a novel photographic transformation that automatically translates a selfie into a neutral-pose portrait. To achieve this, we first collect an unpaired dataset, and introduce a way to synthesize paired training data for self-supervised learning. Then, to $\textit{unselfie}$ a photo, we propose a new three-stage pipeline, where we first find a target neutral pose, inpaint the body texture, and finally refine and composite the person on the background. To obtain a suitable target neutral pose, we propose a novel nearest pose search module that makes the reposing task easier and enables the generation of multiple neutral-pose results among which users can choose the best one they like. Qualitative and quantitative evaluations show the superiority of our pipeline over alternatives.

45.Generative Classifiers as a Basis for Trustworthy Computer Vision ⬇️

With the maturing of deep learning systems, trustworthiness is becoming increasingly important for model assessment. We understand trustworthiness as the combination of explainability and robustness. Generative classifiers (GCs) are a promising class of models that are said to naturally accomplish these qualities. However, this has mostly been demonstrated on simple datasets such as MNIST, SVHN and CIFAR in the past. In this work, we firstly develop an architecture and training scheme that allows for GCs to be trained on the ImageNet classification task, a more relevant level of complexity for practical computer vision. The resulting models use an invertible neural network architecture and achieve a competetive ImageNet top-1 accuracy of up to 76.2%. Secondly, we show the large potential of GCs for trustworthiness. Explainability and some aspects of robustness are vastly improved compared to standard feed-forward models, even when the GCs are just applied naively. While not all trustworthiness problems are solved completely, we argue from our observations that GCs are an extremely promising basis for further algorithms and modifications, as have been developed in the past for feedforward models to increase their trustworthiness. We release our trained model for download in the hope that it serves as a starting point for various other generative classification tasks in much the same way as pretrained ResNet models do for discriminative classification.

46.Hybrid Deep Learning Gaussian Process for Diabetic Retinopathy Diagnosis and Uncertainty Quantification ⬇️

Diabetic Retinopathy (DR) is one of the microvascular complications of Diabetes Mellitus, which remains as one of the leading causes of blindness worldwide. Computational models based on Convolutional Neural Networks represent the state of the art for the automatic detection of DR using eye fundus images. Most of the current work address this problem as a binary classification task. However, including the grade estimation and quantification of predictions uncertainty can potentially increase the robustness of the model. In this paper, a hybrid Deep Learning-Gaussian process method for DR diagnosis and uncertainty quantification is presented. This method combines the representational power of deep learning, with the ability to generalize from small datasets of Gaussian process models. The results show that uncertainty quantification in the predictions improves the interpretability of the method as a diagnostic support tool. The source code to replicate the experiments is publicly available at this https URL.

47.Comparative study of deep learning methods for the automatic segmentation of lung, lesion and lesion type in CT scans of COVID-19 patients ⬇️

Recent research on COVID-19 suggests that CT imaging provides useful information to assess disease progression and assist diagnosis, in addition to help understanding the disease. There is an increasing number of studies that propose to use deep learning to provide fast and accurate quantification of COVID-19 using chest CT scans. The main tasks of interest are the automatic segmentation of lung and lung lesions in chest CT scans of confirmed or suspected COVID-19 patients. In this study, we compare twelve deep learning algorithms using a multi-center dataset, including both open-source and in-house developed algorithms. Results show that ensembling different methods can boost the overall test set performance for lung segmentation, binary lesion segmentation and multiclass lesion segmentation, resulting in mean Dice scores of 0.982, 0.724 and 0.469, respectively. The resulting binary lesions were segmented with a mean absolute volume error of 91.3 ml. In general, the task of distinguishing different lesion types was more difficult, with a mean absolute volume difference of 152 ml and mean Dice scores of 0.369 and 0.523 for consolidation and ground glass opacity, respectively. All methods perform binary lesion segmentation with an average volume error that is better than visual assessment by human raters, suggesting these methods are mature enough for a large-scale evaluation for use in clinical practice.

48.Searching for Pneumothorax in Half a Million Chest X-Ray Images ⬇️

Pneumothorax, a collapsed or dropped lung, is a fatal condition typically detected on a chest X-ray by an experienced radiologist. Due to shortage of such experts, automated detection systems based on deep neural networks have been developed. Nevertheless, applying such systems in practice remains a challenge. These systems, mostly compute a single probability as output, may not be enough for diagnosis. On the contrary, content-based medical image retrieval (CBIR) systems, such as image search, can assist clinicians for diagnostic purposes by enabling them to compare the case they are examining with previous (already diagnosed) cases. However, there is a lack of study on such attempt. In this study, we explored the use of image search to classify pneumothorax among chest X-ray images. All chest X-ray images were first tagged with deep pretrained features, which were obtained from existing deep learning models. Given a query chest X-ray image, the majority voting of the top K retrieved images was then used as a classifier, in which similar cases in the archive of past cases are provided besides the probability output. In our experiments, 551,383 chest X-ray images were obtained from three large recently released public datasets. Using 10-fold cross-validation, it is shown that image search on deep pretrained features achieved promising results compared to those obtained by traditional classifiers trained on the same features. To the best of knowledge, it is the first study to demonstrate that deep pretrained features can be used for CBIR of pneumothorax in half a million chest X-ray images.

49.Few shot domain adaptation for in situ macromolecule structural classification in cryo-electron tomograms ⬇️

Motivation: Cryo-Electron Tomography (cryo-ET) visualizes structure and spatial organization of macromolecules and their interactions with other subcellular components inside single cells in the close-to-native state at sub-molecular resolution. Such information is critical for the accurate understanding of cellular processes. However, subtomogram classification remains one of the major challenges for the systematic recognition and recovery of the macromolecule structures in cryo-ET because of imaging limits and data quantity. Recently, deep learning has significantly improved the throughput and accuracy of large-scale subtomogram classification. However often it is difficult to get enough high-quality annotated subtomogram data for supervised training due to the enormous expense of labeling. To tackle this problem, it is beneficial to utilize another already annotated dataset to assist the training process. However, due to the discrepancy of image intensity distribution between source domain and target domain, the model trained on subtomograms in source domainmay perform poorly in predicting subtomogram classes in the target domain.
Results: In this paper, we adapt a few shot domain adaptation method for deep learning based cross-domain subtomogram classification. The essential idea of our method consists of two parts: 1) take full advantage of the distribution of plentiful unlabeled target domain data, and 2) exploit the correlation between the whole source domain dataset and few labeled target domain data. Experiments conducted on simulated and real datasets show that our method achieves significant improvement on cross domain subtomogram classification compared with baseline methods.

50.Very Deep Super-Resolution of Remotely Sensed Images with Mean Square Error and Var-norm Estimators as Loss Functions ⬇️

In this work, very deep super-resolution (VDSR) method is presented for improving the spatial resolution of remotely sensed (RS) images for scale factor 4. The VDSR net is re-trained with Sentinel-2 images and with drone aero orthophoto images, thus becomes RS-VDSR and Aero-VDSR, respectively. A novel loss function, the Var-norm estimator, is proposed in the regression layer of the convolutional neural network during re-training and prediction. According to numerical and optical comparisons, the proposed nets RS-VDSR and Aero-VDSR can outperform VDSR during prediction with RS images. RS-VDSR outperforms VDSR up to 3.16 dB in terms of PSNR in Sentinel-2 images.

51.Black-box Adversarial Sample Generation Based on Differential Evolution ⬇️

Deep Neural Networks (DNNs) are being used in various daily tasks such as object detection, speech processing, and machine translation. However, it is known that DNNs suffer from robustness problems -- perturbed inputs called adversarial samples leading to misbehaviors of DNNs. In this paper, we propose a black-box technique called Black-box Momentum Iterative Fast Gradient Sign Method (BMI-FGSM) to test the robustness of DNN models. The technique does not require any knowledge of the structure or weights of the target DNN. Compared to existing white-box testing techniques that require accessing model internal information such as gradients, our technique approximates gradients through Differential Evolution and uses approximated gradients to construct adversarial samples. Experimental results show that our technique can achieve 100% success in generating adversarial samples to trigger misclassification, and over 95% success in generating samples to trigger misclassification to a specific target output label. It also demonstrates better perturbation distance and better transferability. Compared to the state-of-the-art black-box technique, our technique is more efficient. Furthermore, we conduct testing on the commercial Aliyun API and successfully trigger its misbehavior within a limited number of queries, demonstrating the feasibility of real-world black-box attack.

52.DeepPeep: Exploiting Design Ramifications to Decipher the Architecture of Compact DNNs ⬇️

The remarkable predictive performance of deep neural networks (DNNs) has led to their adoption in service domains of unprecedented scale and scope. However, the widespread adoption and growing commercialization of DNNs have underscored the importance of intellectual property (IP) protection. Devising techniques to ensure IP protection has become necessary due to the increasing trend of outsourcing the DNN computations on the untrusted accelerators in cloud-based services. The design methodologies and hyper-parameters of DNNs are crucial information, and leaking them may cause massive economic loss to the organization. Furthermore, the knowledge of DNN's architecture can increase the success probability of an adversarial attack where an adversary perturbs the inputs and alter the prediction.
In this work, we devise a two-stage attack methodology "DeepPeep" which exploits the distinctive characteristics of design methodologies to reverse-engineer the architecture of building blocks in compact DNNs. We show the efficacy of "DeepPeep" on P100 and P4000 GPUs. Additionally, we propose intelligent design maneuvering strategies for thwarting IP theft through the DeepPeep attack and proposed "Secure MobileNet-V1". Interestingly, compared to vanilla MobileNet-V1, secure MobileNet-V1 provides a significant reduction in inference latency ($\approx$60%) and improvement in predictive performance ($\approx$2%) with very-low memory and computation overheads.

53.Unsupervised Event Detection, Clustering, and Use Case Exposition in Micro-PMU Measurements ⬇️

Distribution-level phasor measurement units, a.k.a, micro-PMUs, report a large volume of high resolution phasor measurements which constitute a variety of event signatures of different phenomena that occur all across power distribution feeders. In order to implement an event-based analysis that has useful applications for the utility operator, one needs to extract these events from a large volume of micro-PMU data. However, due to the infrequent, unscheduled, and unknown nature of the events, it is often a challenge to even figure out what kind of events are out there to capture and scrutinize. In this paper, we seek to address this open problem by developing an unsupervised approach, which requires minimal prior human knowledge. First, we develop an unsupervised event detection method based on the concept of Generative Adversarial Networks (GAN). It works by training deep neural networks that learn the characteristics of the normal trends in micro-PMU measurements; and accordingly detect an event when there is any abnormality. We also propose a two-step unsupervised clustering method, based on a novel linear mixed integer programming formulation. It helps us categorize events based on their origin in the first step and their similarity in the second step. The active nature of the proposed clustering method makes it capable of identifying new clusters of events on an ongoing basis. The proposed unsupervised event detection and clustering methods are applied to real-world micro-PMU data. Results show that they can outperform the prevalent methods in the literature. These methods also facilitate our further analysis to identify important clusters of events that lead to unmasking several use cases that could be of value to the utility operator.

54.Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation ⬇️

Segmenting unseen objects in cluttered scenes is an important skill that robots need to acquire in order to perform tasks in new environments. In this work, we propose a new method for unseen object instance segmentation by learning RGB-D feature embeddings from synthetic data. A metric learning loss function is utilized to learn to produce pixel-wise feature embeddings such that pixels from the same object are close to each other and pixels from different objects are separated in the embedding space. With the learned feature embeddings, a mean shift clustering algorithm can be applied to discover and segment unseen objects. We further improve the segmentation accuracy with a new two-stage clustering algorithm. Our method demonstrates that non-photorealistic synthetic RGB and depth images can be used to learn feature embeddings that transfer well to real-world images for unseen object instance segmentation.

55.OrcVIO: Object residual constrained Visual-Inertial Odometry ⬇️

Introducing object-level semantic information into simultaneous localization and mapping (SLAM) system is critical. It not only improves the performance but also enables tasks specified in terms of meaningful objects. This work presents OrcVIO, for visual-inertial odometry tightly coupled with tracking and optimization over structured object models. OrcVIO differentiates through semantic feature and bounding-box reprojection errors to perform batch optimization over the pose and shape of objects. The estimated object states aid in real-time incremental optimization over the IMU-camera states. The ability of OrcVIO for accurate trajectory estimation and large-scale object-level mapping is evaluated using real data.