Skip to content

Latest commit

 

History

History
95 lines (95 loc) · 61.5 KB

20181030.md

File metadata and controls

95 lines (95 loc) · 61.5 KB

ArXiv cs.CV --Tue, 30 Oct 2018

1.Compressive Sampling Approach for Image Acquisition with Lensless Endoscope pdf

The lensless endoscope is a promising device designed to image tissues in vivo at the cellular scale. The traditional acquisition setup consists in raster scanning during which the focused light beam from the optical fiber illuminates sequentially each pixel of the field of view (FOV). The calibration step to focus the beam and the sampling scheme both take time. In this preliminary work, we propose a scanning method based on compressive sampling theory. The method does not rely on a focused beam but rather on the random illumination patterns generated by the single-mode fibers. Experiments are performed on synthetic data for different compression rates (from 10 to 100% of the FOV).

2.Few-shot 3D Multi-modal Medical Image Segmentation using Generative Adversarial Learning pdf

We address the problem of segmenting 3D multi-modal medical images in scenarios where very few labeled examples are available for training. Leveraging the recent success of adversarial learning for semi-supervised segmentation, we propose a novel method based on Generative Adversarial Networks (GANs) to train a segmentation model with both labeled and unlabeled images. The proposed method prevents over-fitting by learning to discriminate between true and fake patches obtained by a generator network. Our work extends current adversarial learning approaches, which focus on 2D single-modality images, to the more challenging context of 3D volumes of multiple modalities. The proposed method is evaluated on the problem of segmenting brain MRI from the iSEG-2017 and MRBrainS 2013 datasets. Significant performance improvement is reported, compared to state-of-art segmentation networks trained in a fully-supervised manner. In addition, our work presents a comprehensive analysis of different GAN architectures for semi-supervised segmentation, showing recent techniques like feature matching to yield a higher performance than conventional adversarial training approaches. Our code is publicly available at this https URL

3.A Coarse-to-fine Pyramidal Model for Person Re-identification via Multi-Loss Dynamic Training pdf

Most existing Re-IDentification (Re-ID) methods are highly dependent on precise bounding boxes that enable images to be aligned with each other. However, due to the inevitable challenging scenarios, current detection models often output inaccurate bounding boxes yet, which inevitably worsen the performance of these Re-ID algorithms. In this paper, to relax the requirement, we propose a novel coarse-to-fine pyramid model that not only incorporates local and global information, but also integrates the gradual cues between them. The pyramid model is able to match the cues at different scales and then search for the correct image of the same identity even when the image pair are not aligned. In addition, in order to learn discriminative identity representation, we explore a dynamic training scheme to seamlessly unify two losses and extract appropriate shared information between them. Experimental results clearly demonstrate that the proposed method achieves the state-of-the-art results on three datasets and it is worth noting that our approach exceeds the current best method by 9.5% on the most challenging dataset CUHK03.

4.Automatic CNN-based detection of cardiac MR motion artefacts using k-space data augmentation and curriculum learning pdf

Good quality of medical images is a prerequisite for the success of subsequent image analysis pipelines. Quality assessment of medical images is therefore an essential activity and for large population studies such as the UK Biobank (UKBB), manual identification of artefacts such as those caused by unanticipated motion is tedious and time-consuming. Therefore, there is an urgent need for automatic image quality assessment techniques. In this paper, we propose a method to automatically detect the presence of motion-related artefacts in cardiac magnetic resonance (CMR) cine images. We compare two deep learning architectures to classify poor quality CMR images: 1) 3D spatio-temporal Convolutional Neural Networks (3D-CNN), 2) Long-term Recurrent Convolutional Network (LRCN). Though in real clinical setup motion artefacts are common, high-quality imaging of UKBB, which comprises cross-sectional population data of volunteers who do not necessarily have health problems creates a highly imbalanced classification problem. Due to the high number of good quality images compared to the relatively low number of images with motion artefacts, we propose a novel data augmentation scheme based on synthetic artefact creation in k-space. We also investigate a learning approach using a predetermined curriculum based on synthetic artefact severity. We evaluate our pipeline on a subset of the UK Biobank data set consisting of 3510 CMR images. The LRCN architecture outperformed the 3D-CNN architecture and was able to detect 2D+time short axis images with motion artefacts in less than 1ms with high recall. We compare our approach to a range of state-of-the-art quality assessment methods. The novel data augmentation and curriculum learning approaches both improved classification performance achieving overall area under the ROC curve of 0.89.

5.Causal Inference in Nonverbal Dyadic Communication with Relevant Interval Selection and Granger Causality pdf

Human nonverbal emotional communication in dyadic dialogs is a process of mutual influence and adaptation. Identifying the direction of influence, or cause-effect relation between participants is a challenging task, due to two main obstacles. First, distinct emotions might not be clearly visible. Second, participants cause-effect relation is transient and variant over time. In this paper, we address these difficulties by using facial expressions that can be present even when strong distinct facial emotions are not visible. We also propose to apply a relevant interval selection approach prior to causal inference to identify those transient intervals where adaptation process occurs. To identify the direction of influence, we apply the concept of Granger causality to the time series of facial expressions on the set of relevant intervals. We tested our approach on synthetic data and then applied it to newly, experimentally obtained data. Here, we were able to show that a more sensitive facial expression detection algorithm and a relevant interval detection approach is most promising to reveal the cause-effect pattern for dyadic communication in various instructed interaction conditions.

6.Real-Time RGB-D Camera Pose Estimation in Novel Scenes using a Relocalisation Cascade pdf

Camera pose estimation is an important problem in computer vision. Common techniques either match the current image against keyframes with known poses, directly regress the pose, or establish correspondences between keypoints in the image and points in the scene to estimate the pose. In recent years, regression forests have become a popular alternative to establish such correspondences. They achieve accurate results, but have traditionally needed to be trained offline on the target scene, preventing relocalisation in new environments. Recently, we showed how to circumvent this limitation by adapting a pre-trained forest to a new scene on the fly. The adapted forests achieved relocalisation performance that was on par with that of offline forests, and our approach was able to estimate the camera pose in close to real time. In this paper, we present an extension of this work that achieves significantly better relocalisation performance whilst running fully in real time. To achieve this, we make several changes to the original approach: (i) instead of accepting the camera pose hypothesis without question, we make it possible to score the final few hypotheses using a geometric approach and select the most promising; (ii) we chain several instantiations of our relocaliser together in a cascade, allowing us to try faster but less accurate relocalisation first, only falling back to slower, more accurate relocalisation as necessary; and (iii) we tune the parameters of our cascade to achieve effective overall performance. These changes allow us to significantly improve upon the performance our original state-of-the-art method was able to achieve on the well-known 7-Scenes and Stanford 4 Scenes benchmarks. As additional contributions, we present a way of visualising the internal behaviour of our forests and show how to entirely circumvent the need to pre-train a forest on a generic scene.

7.Recurrent Transformer Networks for Semantic Correspondence pdf

We present recurrent transformer networks (RTNs) for obtaining dense correspondences between semantically similar images. Our networks accomplish this through an iterative process of estimating spatial transformations between the input images and using these transformations to generate aligned convolutional activations. By directly estimating the transformations between an image pair, rather than employing spatial transformer networks to independently normalize each individual image, we show that greater accuracy can be achieved. This process is conducted in a recursive manner to refine both the transformation estimates and the feature representations. In addition, a technique is presented for weakly-supervised training of RTNs that is based on a proposed classification loss. With RTNs, state-of-the-art performance is attained on several benchmarks for semantic correspondence.

8.Imagination Based Sample Construction for Zero-Shot Learning pdf

Zero-shot learning (ZSL) which aims to recognize unseen classes with no labeled training sample, efficiently tackles the problem of missing labeled data in image retrieval. Nowadays there are mainly two types of popular methods for ZSL to recognize images of unseen classes: probabilistic reasoning and feature projection. Different from these existing types of methods, we propose a new method: sample construction to deal with the problem of ZSL. Our proposed method, called Imagination Based Sample Construction (IBSC), innovatively constructs image samples of target classes in feature space by mimicking human associative cognition process. Based on an association between attribute and feature, target samples are constructed from different parts of various samples. Furthermore, dissimilarity representation is employed to select high-quality constructed samples which are used as labeled data to train a specific classifier for those unseen classes. In this way, zero-shot learning is turned into a supervised learning problem. As far as we know, it is the first work to construct samples for ZSL thus, our work is viewed as a baseline for future sample construction methods. Experiments on four benchmark datasets show the superiority of our proposed method.

9.Unsupervised Data Selection for Supervised Learning pdf

Recent research put a big effort in the development of deep learning architectures and optimizers obtaining impressive results in areas ranging from vision to language processing. However little attention has been addressed to the need of a methodological process of data collection. In this work we show that high quality data for supervised learning can be selected in an unsupervised manner and that by doing so one can obtain models capable to generalize better than in the case of random training set construction.

10.Burst ranking for blind multi-image deblurring pdf

We propose a new incremental aggregation algorithm for multi-image deblurring with automatic image selection. The primary motivation is that current bursts deblurring methods do not handle well situations in which misalignment or out-of-context frames are present in the burst. These real-life situations result in poor reconstructions or manual selection of the images that will be used to deblur. Automatically selecting best frames within the burst to improve the base reconstruction is challenging because the amount of possible images fusions is equal to the power set cardinal. Here, we approach the multi-image deblurring problem as a two steps process. First, we successfully learn a comparison function to rank a burst of images using a deep convolutional neural network. Then, an incremental Fourier burst accumulation with a reconstruction degradation mechanism is applied fusing only less blurred images that are sufficient to maximize the reconstruction quality. Experiments with the proposed algorithm have shown superior results when compared to other similar approaches, outperforming other methods described in the literature in previously described situations. We validate our findings on several synthetic and real datasets.

11.PartsNet: A Unified Deep Network for Automotive Engine Precision Parts Defect Detection pdf

Defect detection is a basic and essential task in automatic parts production, especially for automotive engine precision parts. In this paper, we propose a new idea to construct a deep convolutional network combining related knowledge of feature processing and the representation ability of deep learning. Our algorithm consists of a pixel-wise segmentation Deep Neural Network (DNN) and a feature refining network. The fully convolutional DNN is presented to learn basic features of parts defects. After that, several typical traditional methods which are used to refine the segmentation results are transformed into convolutional manners and integrated. We assemble these methods as a shallow network with fixed weights and empirical thresholds. These thresholds are then released to enhance its adaptation ability and realize end-to-end training. Testing results on different datasets show that the proposed method has good portability and outperforms the state-of-the-art algorithms.

12.Patch-based Sparse Representation For Bacterial Detection pdf

In this paper, we propose a supervised approach for bacterial detection in optical endomicroscopy images. This approach splits each image into a set of overlapping patches and assumes that observed intensities are linear combinations of the actual intensity values associated with background image structures, corrupted by additive Gaussian noise and potentially by a sparse outlier term modelling anomalies (which are considered to be candidate bacteria). The actual intensity term representing background structures is modelled as a linear combination of a few atoms drawn from a dictionary which is learned from bacteria-free data and then fixed while analyzing new images. The bacteria detection task is formulated as a minimization problem and an Alternating Direction Method of Multipliers (ADMM) is then used to estimate the unknown parameters. Simulations conducted using two ex vivo lung datasets show good detection and correlation performance between bacteria counts identified by a trained clinician and those of the proposed method.

13.An Augmented Linear Mixing Model to Address Spectral Variability for Hyperspectral Unmixing pdf

Hyperspectral imagery collected from airborne or satellite sources inevitably suffers from spectral variability, making it difficult for spectral unmixing to accurately estimate abundance maps. The classical unmixing model, the linear mixing model (LMM), generally fails to handle this sticky issue effectively. To this end, we propose a novel spectral mixture model, called the augmented linear mixing model (ALMM), to address spectral variability by applying a data-driven learning strategy in inverse problems of hyperspectral unmixing. The proposed approach models the main spectral variability (i.e., scaling factors) generated by variations in illumination or typography separately by means of the endmember dictionary. It then models other spectral variabilities caused by environmental conditions (e.g., local temperature and humidity, atmospheric effects) and instrumental configurations (e.g., sensor noise), as well as material nonlinear mixing effects, by introducing a spectral variability dictionary. To effectively run the data-driven learning strategy, we also propose a reasonable prior knowledge for the spectral variability dictionary, whose atoms are assumed to be low-coherent with spectral signatures of endmembers, which leads to a well-known low coherence dictionary learning problem. Thus, a dictionary learning technique is embedded in the framework of spectral unmixing so that the algorithm can learn the spectral variability dictionary and estimate the abundance maps simultaneously. Extensive experiments on synthetic and real datasets are performed to demonstrate the superiority and effectiveness of the proposed method in comparison with previous state-of-the-art methods.

14.GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild pdf

In this work, we introduce a large high-diversity database for generic object tracking, called GOT-10k. GOT-10k is backboned by the semantic hierarchy of WordNet. It populates a majority of 563 object classes and 87 motion patterns in real-world, resulting in a scale of over 10 thousand video segments and 1.5 million bounding boxes. To our knowledge, GOT-10k is by far the richest motion trajectory dataset, and its coverage of object classes is more than a magnitude wider than similar scale counterparts. By publishing GOT-10k, we hope to encourage the development of generic purposed trackers that work for a wide range of moving objects and under diverse real-world scenarios. To promote generalization and avoid the evaluation results biased to seen classes, we follow the one-shot principle in dataset splitting where training and testing classes are zero-overlapped. We also carry out a series of analytical experiments to select a compact while highly representative testing subset -- it embodies 84 object classes and 32 motion patterns with only 180 video segments, allowing for efficient evaluation. Finally, we train and evaluate a number of representative trackers on GOT-10k and analyze their performance. The evaluation results suggest that tracking in real-world unconstrained videos is far from being solved, and only 40% of frames are successfully tracked using top ranking trackers. All the dataset, evaluation toolkit and baseline results will be made available.

15.Evolutionary Self-Expressive Models for Subspace Clustering pdf

The problem of organizing data that evolves over time into clusters is encountered in a number of practical settings. We introduce evolutionary subspace clustering, a method whose objective is to cluster a collection of evolving data points that lie on a union of low-dimensional evolving subspaces. To learn the parsimonious representation of the data points at each time step, we propose a non-convex optimization framework that exploits the self-expressiveness property of the evolving data while taking into account representation from the preceding time step. To find an approximate solution to the aforementioned non-convex optimization problem, we develop a scheme based on alternating minimization that both learns the parsimonious representation as well as adaptively tunes and infers a smoothing parameter reflective of the rate of data evolution. The latter addresses a fundamental challenge in evolutionary clustering -- determining if and to what extent one should consider previous clustering solutions when analyzing an evolving data collection. Our experiments on both synthetic and real-world datasets demonstrate that the proposed framework outperforms state-of-the-art static subspace clustering algorithms and existing evolutionary clustering schemes in terms of both accuracy and running time, in a range of scenarios.

16.Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language pdf

This paper addresses the problem of manipulating images using natural language description. Our task aims to semantically modify visual attributes of an object in an image according to the text describing the new visual appearance. Although existing methods synthesize images having new attributes, they do not fully preserve text-irrelevant contents of the original image. In this paper, we propose the text-adaptive generative adversarial network (TAGAN) to generate semantically manipulated images while preserving text-irrelevant contents. The key to our method is the text-adaptive discriminator that creates word-level local discriminators according to input text to classify fine-grained attributes independently. With this discriminator, the generator learns to generate images where only regions that correspond to the given text are modified. Experimental results show that our method outperforms existing methods on CUB and Oxford-102 datasets, and our results were mostly preferred on a user study. Extensive analysis shows that our method is able to effectively disentangle visual attributes and produce pleasing outputs.

17.Sequential anatomy localization in fetal echocardiography videos pdf

Fetal heart motion is an important diagnostic indicator for structural detection and functional assessment of congenital heart disease. We propose an approach towards integrating deep convolutional and recurrent architectures that utilize localized spatial and temporal features of different anatomical substructures within a global spatiotemporal context for interpretation of fetal echocardiography videos. We formulate our task as a cardiac structure localization problem with convolutional architectures for aggregating global spatial context and detecting anatomical structures on spatial region proposals. This information is aggregated temporally by recurrent architectures to quantify the progressive motion patterns. We experimentally show that the resulting architecture combines anatomical landmark detection at the frame-level over multiple video sequences-with temporal progress of the associated anatomical motions to encode local spatiotemporal fetal heart dynamics and is validated on a real-world clinical dataset.

18.Scale Estimation of Monocular SfM for a Multi-modal Stereo Camera pdf

This paper proposes a novel method of estimating the absolute scale of monocular SfM for a multi-modal stereo camera. In the fields of computer vision and robotics, scale estimation for monocular SfM has been widely investigated in order to simplify systems. This paper addresses the scale estimation problem for a stereo camera system in which two cameras capture different spectral images (e.g., RGB and FIR), whose feature points are difficult to directly match using descriptors. Furthermore, the number of matching points between FIR images can be comparatively small, owing to the low resolution and lack of thermal scene texture. To cope with these difficulties, the proposed method estimates the scale parameter using batch optimization, based on the epipolar constraint of a small number of feature correspondences between the invisible light images. The accuracy and numerical stability of the proposed method are verified by synthetic and real image experiments.

19.Enhanced CNN for image denoising pdf

Due to the flexible architectures of deep convolutional neural networks (CNNs), which are successfully used for image denoising. However, they suffer from the following drawbacks: (1) deep network architecture is very difficult to be train. (2) Very deep networks face the challenge of performance saturation. In this paper, we propose a novel method called enhanced convolutional neural denoising network (ECNDNet). Specifically, we use residual learning and batch normalization (BN) techniques to address the problem of training difficulties and accelerate the convergence of the network. In addition, dilated convolutions are used in our network to enlarge the context information and reduce the computational cost. Extensive experiments demonstrate that our ECNDNet outperforms the state-of-the-art methods such as IRCNN for image denoising.

20.Multi-Spectral Imaging via Computed Tomography (MUSIC) - Comparing Unsupervised Spectral Segmentations for Material Differentiation pdf

Multi-spectral computed tomography is an emerging technology for the non-destructive identification of object materials and the study of their physical properties. Applications of this technology can be found in various scientific and industrial contexts, such as luggage scanning at airports. Material distinction and its identification is challenging, even with spectral x-ray information, due to acquisition noise, tomographic reconstruction artefacts and scanning setup application constraints. We present MUSIC - and open access multi-spectral CT dataset in 2D and 3D - to promote further research in the area of material identification. We demonstrate the value of this dataset on the image analysis challenge of object segmentation purely based on the spectral response of its composing materials. In this context, we compare the segmentation accuracy of fast adaptive mean shift (FAMS) and unconstrained graph cuts on both datasets. We further discuss the impact of reconstruction artefacts and segmentation controls on the achievable results. Dataset, related software packages and further documentation are made available to the imaging community in an open-access manner to promote further data-driven research on the subject

21.Object Tracking in Hyperspectral Videos with Convolutional Features and Kernelized Correlation Filter pdf

Target tracking in hyperspectral videos is a new research topic. In this paper, a novel method based on convolutional network and Kernelized Correlation Filter (KCF) framework is presented for tracking objects of interest in hyperspectral videos. We extract a set of normalized three-dimensional cubes from the target region as fixed convolution filters which contain spectral information surrounding a target. The feature maps generated by convolutional operations are combined to form a three-dimensional representation of an object, thereby providing effective encoding of local spectral-spatial information. We show that a simple two-layer convolutional networks is sufficient to learn robust representations without the need of offline training with a large dataset. In the tracking step, KCF is adopted to distinguish targets from neighboring environment. Experimental results demonstrate that the proposed method performs well on sample hyperspectral videos, and outperforms several state-of-the-art methods tested on grayscale and color videos in the same scene.

22.Discrimination-aware Channel Pruning for Deep Neural Networks pdf

Channel pruning is one of the predominant approaches for deep model compression. Existing pruning methods either train from scratch with sparsity constraints on channels, or minimize the reconstruction error between the pre-trained feature maps and the compressed ones. Both strategies suffer from some limitations: the former kind is computationally expensive and difficult to converge, whilst the latter kind optimizes the reconstruction error but ignores the discriminative power of channels. To overcome these drawbacks, we investigate a simple-yet-effective method, called discrimination-aware channel pruning, to choose those channels that really contribute to discriminative power. To this end, we introduce additional losses into the network to increase the discriminative power of intermediate layers and then select the most discriminative channels for each layer by considering the additional loss and the reconstruction error. Last, we propose a greedy algorithm to conduct channel selection and parameter optimization in an iterative way. Extensive experiments demonstrate the effectiveness of our method. For example, on ILSVRC-12, our pruned ResNet-50 with 30% reduction of channels even outperforms the original model by 0.39% in top-1 accuracy.

23.Cascaded Pyramid Mining Network for Weakly Supervised Temporal Action Localization pdf

Weakly supervised temporal action localization, which aims at temporally locating action instances in untrimmed videos using only video-level class labels during training, is an important yet challenging problem in video analysis. Many current methods adopt the "localization by classification" framework: first do video classification, then locate temporal area contributing to the results most. However, this framework fails to locate the entire action instances and gives little consideration to the local context. In this paper, we present a novel architecture called Cascaded Pyramid Mining Network (CPMN) to address these issues using two effective modules. First, to discover the entire temporal interval of specific action, we design a two-stage cascaded module with proposed Online Adversarial Erasing (OAE) mechanism, where new and complementary regions are mined through feeding the erased feature maps of discovered regions back to the system. Second, to exploit hierarchical contextual information in videos and reduce missing detections, we design a pyramid module which produces a scale-invariant attention map through combining the feature maps from different levels. Final, we aggregate the results of two modules to perform action localization via locating high score areas in temporal Class Activation Sequence (CAS). Extensive experiments conducted on THUMOS14 and ActivityNet-1.3 datasets demonstrate the effectiveness of our method.

24.Deep Affinity Network for Multiple Object Tracking pdf

Multiple Object Tracking (MOT) plays an important role in solving many fundamental problems in video analysis in computer vision. Most MOT methods employ two steps: Object Detection and Data Association. The first step detects objects of interest in every frame of a video, and the second establishes correspondence between the detected objects in different frames to obtain their tracks. Object detection has made tremendous progress in the last few years due to deep learning. However, data association for tracking still relies on hand crafted constraints such as appearance, motion, spatial proximity, grouping etc. to compute affinities between the objects in different frames. In this paper, we harness the power of deep learning for data association in tracking by jointly modelling object appearances and their affinities between different frames in an end-to-end fashion. The proposed Deep Affinity Network (DAN) learns compact; yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities. DAN also accounts for multiple objects appearing and disappearing between video frames. We exploit the resulting efficient affinity computations to associate objects in the current frame deep into the previous frames for reliable on-line tracking. Our technique is evaluated on popular multiple object tracking challenges MOT15, MOT17 and UA-DETRAC. Comprehensive benchmarking under twelve evaluation metrics demonstrates that our approach is among the best performing techniques on the leader board for these challenges. The open source implementation of our work is available at this https URL.

25.Towards Human Pulse Rate Estimation from Face Video: Automatic Component Selection and Comparison of Blind Source Separation Methods pdf

Human heartbeat can be measured using several different ways appropriately based on the patient condition which includes contact base such as measured by using instruments and non-contact base such as computer vision assisted techniques. Non-contact based approached are getting popular due to those techniques are capable of mitigating some of the limitations of contact-based techniques especially in clinical section. However, existing vision guided approaches are not able to prove high accurate result due to various reason such as the property of camera, illumination changes, skin tones in face image, etc. We propose a technique that uses video as an input and returns pulse rate in output. Initially, key point detection is carried out on two facial subregions: forehead and nose-mouth. After removing unstable features, the temporal filtering is applied to isolate frequencies of interest. Then four component analysis methods are employed in order to distinguish the cardiovascular pulse signal from extraneous noise caused by respiration, vestibular activity and other changes in facial expression. Afterwards, proposed peak detection technique is applied for each component which extracted from one of the four different component selection algorithms. This will enable to locate the positions of peaks in each component. Proposed automatic components selection technique is employed in order to select an optimal component which will be used to calculate the heartbeat. Finally, we conclude with a comparison of four component analysis methods (PCA, FastICA, JADE, SHIBBS), processing face video datasets of fifteen volunteers with verification by an ECG/EKG Workstation as a ground truth.

26.Real-time Action Recognition with Dissimilarity-based Training of Specialized Module Networks pdf

This paper addresses the problem of real-time action recognition in trimmed videos, for which deep neural networks have defined the state-of-the-art performance in the recent literature. For attaining higher recognition accuracies with efficient computations, researchers have addressed the various aspects of limitations in the recognition pipeline. This includes network architecture, the number of input streams (where additional streams augment the color information), the cost function to be optimized, in addition to others. The literature has always aimed, though, at assigning the adopted network (or networks, in case of multiple streams) the task of recognizing the whole number of action classes in the dataset at hand. We propose to train multiple specialized module networks instead. Each module is trained to recognize a subset of the action classes. Towards this goal, we present a dissimilarity-based optimized procedure for distributing the action classes over the modules, which can be trained simultaneously offline. On two standard datasets--UCF-101 and HMDB-51--the proposed method demonstrates a comparable performance, that is superior in some aspects, to the state-of-the-art, and that satisfies the real-time constraint. We achieved 72.5% accuracy on the challenging HMDB-51 dataset. By assigning fewer and unalike classes to each module network, this research paves the way to benefit from light-weight architectures without compromising recognition accuracy.

27.On the role of ML estimation and Bregman divergences in sparse representation of covariance and precision matrices pdf

Sparse representation of structured signals requires modelling strategies that maintain specific signal properties, in addition to preserving original information content and achieving simpler signal representation. Therefore, the major design challenge is to introduce adequate problem formulations and offer solutions that will efficiently lead to desired representations. In this context, sparse representation of covariance and precision matrices, which appear as feature descriptors or mixture model parameters, respectively, will be in the main focus of this paper.

28.Flash Photography for Data-Driven Hidden Scene Recovery pdf

Vehicles, search and rescue personnel, and endoscopes use flash lights to locate, identify, and view objects in their surroundings. Here we show the first steps of how all these tasks can be done around corners with consumer cameras. Recent techniques for NLOS imaging using consumer cameras have not been able to both localize and identify the hidden object. We introduce a method that couples traditional geometric understanding and data-driven techniques. To avoid the limitation of large dataset gathering, we train the data-driven models on rendered samples to computationally recover the hidden scene on real data. The method has three independent operating modes: 1) a regression output to localize a hidden object in 2D, 2) an identification output to identify the object type or pose, and 3) a generative network to reconstruct the hidden scene from a new viewpoint. The method is able to localize 12cm wide hidden objects in 2D with 1.7cm accuracy. The method also identifies the hidden object class with 87.7% accuracy (compared to 33.3% random accuracy). This paper also provides an analysis on the distribution of information that encodes the occluded object in the accessible scene. We show that, unlike previously thought, the area that extends beyond the corner is essential for accurate object localization and identification.

29.3D Terrain Segmentation in the SWIR Spectrum pdf

We focus on the automatic 3D terrain segmentation problem using hyperspectral shortwave IR (HS-SWIR) imagery and 3D Digital Elevation Models (DEM). The datasets were independently collected, and metadata for the HS-SWIR dataset are unavailable. We explore an overall slope of the SWIR spectrum that correlates with the presence of moisture in soil to propose a band ratio test to be used as a proxy for soil moisture content to distinguish two broad classes of objects: live vegetation from impermeable manmade surface. We show that image based localization techniques combined with the Optimal Randomized RANdom Sample Consensus (RANSAC) algorithm achieve precise spatial matches between HS-SWIR data of a portion of downtown Los Angeles (LA (USA)) and the Visible image of a geo-registered 3D DEM, covering a wider-area of LA. Our spectral-elevation rule based approach yields an overall accuracy of 97.7%, segmenting the object classes into buildings, houses, trees, grass, and roads/parking lots.

30.Accelerated Inference in Markov Random Fields via Smooth Riemannian Optimization pdf

Markov Random Fields (MRFs) are a popular model for several pattern recognition and reconstruction problems in robotics and computer vision. Inference in MRFs is intractable in general and related work resorts to approximation algorithms. Among those techniques, semidefinite programming (SDP) relaxations have been shown to provide accurate estimates while scaling poorly with the problem size and being typically slow for practical applications. Our first contribution is to design a dual ascent method to solve standard SDP relaxations that takes advantage of the geometric structure of the problem to speed up computation. This technique, named Dual Ascent Riemannian Staircase (DARS), is able to solve large problem instances in seconds. Since our goal is to enable real-time inference on robotics platforms, we develop a second and faster approach. The backbone of this second approach is a novel SDP relaxation combined with a fast and scalable solver based on smooth Riemannian optimization. We show that this approach, named Fast Unconstrained SEmidefinite Solver (FUSES), can solve large problems in milliseconds. Contrarily to local MRF solvers, e.g., loopy belief propagation, our approaches do not require an initial guess. Moreover, we leverage recent results from optimization theory to provide per-instance sub-optimality guarantees. We demonstrate the proposed approaches in multi-class image segmentation problems. Extensive experimental evidence shows that (i) FUSES and DARS produce near-optimal solutions, attaining an objective within 0.2% of the optimum, (ii) FUSES and DARS are remarkably faster than general-purpose SDP solvers, and FUSES is more than two orders of magnitude faster than DARS while attaining similar solution quality, (iii) FUSES is faster than local search methods while being a global solver, and is a good candidate for real-time robotics applications.

31.3D MRI brain tumor segmentation using autoencoder regularization pdf

Automated segmentation of brain tumors from 3D magnetic resonance images (MRIs) is necessary for the diagnosis, monitoring, and treatment planning of the disease. Manual delineation practices require anatomical knowledge, are expensive, time consuming and can be inaccurate due to human error. Here, we describe a semantic segmentation network for tumor subregion segmentation from 3D MRIs based on encoder-decoder architecture. Due to a limited training dataset size, a variational auto-encoder branch is added to reconstruct the input image itself in order to regularize the shared decoder and impose additional constraints on its layers. The current approach won 1st place in the BraTS 2018 challenge.

32.A Cross-Modal Distillation Network for Person Re-identification in RGB-Depth pdf

Person re-identification involves the recognition over time of individuals captured using multiple distributed sensors. With the advent of powerful deep learning methods able to learn discriminant representations for visual recognition, cross-modal person re-identification based on different sensor modalities has become viable in many challenging applications in, e.g., autonomous driving, robotics and video surveillance. Although some methods have been proposed for re-identification between infrared and RGB images, few address depth and RGB images. In addition to the challenges for each modality associated with occlusion, clutter, misalignment, and variations in pose and illumination, there is a considerable shift across modalities since data from RGB and depth images are heterogeneous. In this paper, a new cross-modal distillation network is proposed for robust person re-identification between RGB and depth sensors. Using a two-step optimization process, the proposed method transfers supervision between modalities such that similar structural features are extracted from both RGB and depth modalities, yielding a discriminative mapping to a common feature space. Our experiments investigate the influence of the dimensionality of the embedding space, compares transfer learning from depth to RGB and vice versa, and compares against other state-of-the-art cross-modal re-identification methods. Results obtained with BIWI and RobotPKU datasets indicate that the proposed method can successfully transfer descriptive structural features from the depth modality to the RGB modality. It can significantly outperform state-of-the-art conventional methods and deep neural networks for cross-modal sensing between RGB and depth, with no impact on computational complexity.

33.Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis pdf

Despite remarkable advances in image synthesis research, existing works often fail in manipulating images under the context of large geometric transformations. Synthesizing person images conditioned on arbitrary poses is one of the most representative examples where the generation quality largely relies on the capability of identifying and modeling arbitrary transformations on different body parts. Current generative models are often built on local convolutions and overlook the key challenges (e.g. heavy occlusions, different views or dramatic appearance changes) when distinct geometric changes happen for each part, caused by arbitrary pose manipulations. This paper aims to resolve these challenges induced by geometric variability and spatial displacements via a new Soft-Gated Warping Generative Adversarial Network (Warping-GAN), which is composed of two stages: 1) it first synthesizes a target part segmentation map given a target pose, which depicts the region-level spatial layouts for guiding image synthesis with higher-level structure constraints; 2) the Warping-GAN equipped with a soft-gated warping-block learns feature-level mapping to render textures from the original image into the generated segmentation map. Warping-GAN is capable of controlling different transformation degrees given distinct target poses. Moreover, the proposed warping-block is light-weight and flexible enough to be injected into any networks. Human perceptual studies and quantitative evaluations demonstrate the superiority of our Warping-GAN that significantly outperforms all existing methods on two large datasets.

34.A Miniaturized Semantic Segmentation Method for Remote Sensing Image pdf

In order to save the memory, we propose a miniaturization method for neural network to reduce the parameter quantity existed in remote sensing (RS) image semantic segmentation model. The compact convolution optimization method is first used for standard U-Net to reduce the weights quantity. With the purpose of decreasing model performance loss caused by miniaturization and based on the characteristics of remote sensing image, fewer down-samplings and improved cascade atrous convolution are then used to improve the performance of the miniaturized U-Net. Compared with U-Net, our proposed Micro-Net not only achieves 29.26 times model compression, but also basically maintains the performance unchanged on the public dataset. We provide a Keras and Tensorflow hybrid programming implementation for our model: this https URL

35.$A^2$-Nets: Double Attention Networks pdf

Learning to capture long-range relations is fundamental to image/video recognition. Existing CNN models generally rely on increasing depth to model such relations which is highly inefficient. In this work, we propose the "double attention block", a novel component that aggregates and propagates informative global features from the entire spatio-temporal space of input images/videos, enabling subsequent convolution layers to access features from the entire space efficiently. The component is designed with a double attention mechanism in two steps, where the first step gathers features from the entire space into a compact set through second-order attention pooling and the second step adaptively selects and distributes features to each location via another attention. The proposed double attention block is easy to adopt and can be plugged into existing deep neural networks conveniently. We conduct extensive ablation studies and experiments on both image and video recognition tasks for evaluating its performance. On the image recognition task, a ResNet-50 equipped with our double attention blocks outperforms a much larger ResNet-152 architecture on ImageNet-1k dataset with over 40% less the number of parameters and less FLOPs. On the action recognition task, our proposed model achieves the state-of-the-art results on the Kinetics and UCF-101 datasets with significantly higher efficiency than recent works.

36.Deep Convolutional Neural Network Applied to Quality Assessment for Video Tracking pdf

Surveillance videos often suffer from blur and exposure distortions that occur during acquisition and storage, which can adversely influence following automatic image analysis results on video-analytic tasks. The purpose of this paper is to deploy an algorithm that can automatically assess the presence of exposure distortion in videos. In this work we to design and build one architecture for deep learning applied to recognition of distortions in a video. The goal is to know if the video present exposure distortions. Such an algorithm could be used to enhance or restoration image or to create an object tracker distortion-aware.

37.Unsupervised Multi-Target Domain Adaptation: An Information Theoretic Approach pdf

Unsupervised domain adaptation (uDA) models focus on pairwise adaptation settings where there is a single, labeled, source and a single target domain. However, in many real-world settings one seeks to adapt to multiple, but somewhat similar, target domains. Applying pairwise adaptation approaches to this setting may be suboptimal, as they fail to leverage shared information among multiple domains. In this work we propose an information theoretic approach for domain adaptation in the novel context of multiple target domains with unlabeled instances and one source domain with labeled instances. Our model aims to find a shared latent space common to all domains, while simultaneously accounting for the remaining private, domain-specific factors. Disentanglement of shared and private information is accomplished using a unified information-theoretic approach, which also serves to establish a stronger link between the latent representations and the observed data. The resulting model, accompanied by an efficient optimization algorithm, allows simultaneous adaptation from a single source to multiple target domains. We test our approach on three challenging publicly-available datasets, showing that it outperforms several popular domain adaptation methods.

38.An Acceleration Scheme to The Local Directional Pattern pdf

This study seeks to improve the running time of the Local Directional Pattern (LDP) during feature extraction using a newly proposed acceleration scheme to LDP. LDP is considered to be computationally expensive. To confirm this, the running time of the LDP to gray level co-occurrence matrix (GLCM) were it was established that the running time for LDP was two orders of magnitude higher than that of the GLCM. In this study, the performance of the newly proposed acceleration scheme was evaluated against LDP and Local Binary patter (LBP) using images from the publicly available extended Cohn-Kanade (CK+) dataset. Based on our findings, the proposed acceleration scheme significantly improves the running time of the LDP by almost 3 times during feature extraction

39.Noise Sensitivity of Local Descriptors vs ConvNets: An application to Facial Recognition pdf

The Local Binary Patterns (LBP) is a local descriptor proposed by Ojala et al to discriminate texture due to its discriminative power. However, the LBP is sensitive to noise and illumination changes. Consequently, several extensions to the LBP such as Median Binary Pattern (MBP) and methods such as Local Directional Pattern (LDP) have been proposed to address its drawbacks. Though studies by Zhou et al, suggest that the LDP exhibits poor performance in presence of random noise. Recently, convolution neural networks (ConvNets) were introduced which are increasingly becoming popular for feature extraction due to their discriminative power. This study aimed at evaluating the sensitivity of ResNet50, a ConvNet pre-trained model and local descriptors (LBP and LDP) to noise using the Extended Yale B face dataset with 5 different levels of noise added to the dataset. In our findings, it was observed that despite adding different levels of noise to the dataset, ResNet50 proved to be more robust than the local descriptors (LBP and LDP).

40.DeepSphere: Efficient spherical Convolutional Neural Network with HEALPix sampling for cosmological applications pdf

Convolutional Neural Networks (CNNs) are a cornerstone of the Deep Learning toolbox and have led to many breakthroughs in Artificial Intelligence. These networks have mostly been developed for regular Euclidean domains such as those supporting images, audio, or video. Because of their success, CNN-based methods are becoming increasingly popular in Cosmology. Cosmological data often comes as spherical maps, which make the use of the traditional CNNs more complicated. The commonly used pixelization scheme for spherical maps is the Hierarchical Equal Area isoLatitude Pixelisation (HEALPix). We present a spherical CNN for analysis of full and partial HEALPix maps, which we call DeepSphere. The spherical CNN is constructed by representing the sphere as a graph. Graphs are versatile data structures that can act as a discrete representation of a continuous manifold. Using the graph-based representation, we define many of the standard CNN operations, such as convolution and pooling. With filters restricted to being radial, our convolutions are equivariant to rotation on the sphere, and DeepSphere can be made invariant or equivariant to rotation. This way, DeepSphere is a special case of a graph CNN, tailored to the HEALPix sampling of the sphere. This approach is computationally more efficient than using spherical harmonics to perform convolutions. We demonstrate the method on a classification problem of weak lensing mass maps from two cosmological models and compare the performance of the CNN with that of two baseline classifiers. The results show that the performance of DeepSphere is always superior or equal to both of these baselines. For high noise levels and for data covering only a smaller fraction of the sphere, DeepSphere achieves typically 10% better classification accuracy than those baselines. Finally, we show how learned filters can be visualized to introspect the neural network.

41.ActionXPose: A Novel 2D Multi-view Pose-based Algorithm for Real-time Human Action Recognition pdf

We present ActionXPose, a novel 2D pose-based algorithm for posture-level Human Action Recognition (HAR). The proposed approach exploits 2D human poses provided by OpenPose detector from RGB videos. ActionXPose aims to process poses data to be provided to a Long Short-Term Memory Neural Network and to a 1D Convolutional Neural Network, which solve the classification problem. ActionXPose is one of the first algorithms that exploits 2D human poses for HAR. The algorithm has real-time performance and it is robust to camera movings, subject proximity changes, viewpoint changes, subject appearance changes and provide high generalization degree. In fact, extensive simulations show that ActionXPose can be successfully trained using different datasets at once. State-of-the-art performance on popular datasets for posture-related HAR problems (i3DPost, KTH) are provided and results are compared with those obtained by other methods, including the selected ActionXPose baseline. Moreover, we also proposed two novel datasets called MPOSE and ISLD recorded in our Intelligent Sensing Lab, to show ActionXPose generalization performance.

42.Embedding Geographic Locations for Modelling the Natural Environment using Flickr Tags and Structured Data pdf

Meta-data from photo-sharing websites such as Flickr can be used to obtain rich bag-of-words descriptions of geographic locations, which have proven valuable, among others, for modelling and predicting ecological features. One important insight from previous work is that the descriptions obtained from Flickr tend to be complementary to the structured information that is available from traditional scientific resources. To better integrate these two diverse sources of information, in this paper we consider a method for learning vector space embeddings of geographic locations. We show experimentally that this method improves on existing approaches, especially in cases where structured information is available.

43.Training Frankenstein's Creature to Stack: HyperTree Architecture Search pdf

We propose HyperTrees for the low cost automatic design of multiple-input neural network models. Much like how Dr. Frankenstein's creature was assembled from pieces before he came to life in the eponymous book, HyperTrees combine parts of other architectures to optimize for a new problem domain. We compare HyperTrees to rENAS, our extension of Efficient Neural Architecture Search (ENAS). To evaluate these architectures we introduce the CoSTAR Block Stacking Dataset for the benchmarking of neural network models. We utilize 5.1 cm colored blocks and introduce complexity with a stacking task, a bin providing wall obstacles, dramatic lighting variation, and object ambiguity in the depth space. We demonstrate HyperTrees and rENAS on this dataset by predicting full 3D poses semantically for the purpose of grasping and placing specific objects. Inputs to the network include RGB images, the current gripper pose, and the action to take. Predictions with our best model are accurate to within 30 degrees 90% of the time, 4 cm 72% of the time, and have an average test error of 3.3 cm and 12.6 degrees. The dataset contains more than 10,000 stacking attempts and 1 million frames of real data. Code and dataset instructions are available at github.com/jhu-lcsr/costar_plan .

44.Fabrik: An Online Collaborative Neural Network Editor pdf

We present Fabrik, an online neural network editor that provides tools to visualize, edit, and share neural networks from within a browser. Fabrik provides a simple and intuitive GUI to import neural networks written in popular deep learning frameworks such as Caffe, Keras, and TensorFlow, and allows users to interact with, build, and edit models via simple drag and drop. Fabrik is designed to be framework agnostic and support high interoperability, and can be used to export models back to any supported framework. Finally, it provides powerful collaborative features to enable users to iterate over model design remotely and at scale.

45.Self-Supervised GAN to Counter Forgetting pdf

GANs involve training two networks in an adversarial game, where each network's task depends on its adversary. Recently, several works have framed GAN training as an online or continual learning problem. We focus on the discriminator, which must perform classification under an (adversarially) shifting data distribution. When trained on sequential tasks, neural networks exhibit \emph{forgetting}. For GANs, discriminator forgetting leads to training instability. To counter forgetting, we encourage the discriminator to maintain useful representations by adding a self-supervision. Conditional GANs have a similar effect using labels. However, our self-supervised GAN does not require labels, and closes the performance gap between conditional and unconditional models. We show that, in doing so, the self-supervised discriminator learns better representations than regular GANs.

46.Monitoring the shape of weather, soundscapes, and dynamical systems: a new statistic for dimension-driven data analysis on large data sets pdf

Dimensionality-reduction methods are a fundamental tool in the analysis of large data sets. These algorithms work on the assumption that the "intrinsic dimension" of the data is generally much smaller than the ambient dimension in which it is collected. Alongside their usual purpose of mapping data into a smaller dimension with minimal information loss, dimensionality-reduction techniques implicitly or explicitly provide information about the dimension of the data set.
In this paper, we propose a new statistic that we call the $¦Ê$-profile for analysis of large data sets. The $¦Ê$-profile arises from a dimensionality-reduction optimization problem: namely that of finding a projection into $k$-dimensions that optimally preserves the secants between points in the data set. From this optimal projection we extract $¦Ê,$ the norm of the shortest projected secant from among the set of all normalized secants. This $¦Ê$ can be computed for any $k$; thus the tuple of $¦Ê$ values (indexed by dimension) becomes a $¦Ê$-profile. Algorithms such as the Secant-Avoidance Projection algorithm and the Hierarchical Secant-Avoidance Projection algorithm, provide a computationally feasible means of estimating the $¦Ê$-profile for large data sets, and thus a method of understanding and monitoring their behavior. As we demonstrate in this paper, the $¦Ê$-profile serves as a useful statistic in several representative settings: weather data, soundscape data, and dynamical systems data.

47.SAM-GCNN: A Gated Convolutional Neural Network with Segment-Level Attention Mechanism for Home Activity Monitoring pdf

In this paper, we propose a method for home activity monitoring. We demonstrate our model on dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Challenge Task 5. This task aims to classify multi-channel audios into one of the provided pre-defined classes. All of these classes are daily activities performed in a home environment. To tackle this task, we propose a gated convolutional neural network with segment-level attention mechanism (SAM-GCNN). The proposed framework is a convolutional model with two auxiliary modules: a gated convolutional neural network and a segment-level attention mechanism. Furthermore, we adopted model ensemble to enhance the capability of generalization of our model. We evaluated our work on the development dataset of DCASE 2018 Task 5 and achieved competitive performance, with a macro-averaged F-1 score increasing from 83.76% to 89.33%, compared with the convolutional baseline system.