Skip to content

Latest commit

 

History

History
51 lines (51 loc) · 29.7 KB

20181004.md

File metadata and controls

51 lines (51 loc) · 29.7 KB

ArXiv cs.CV --Thu, 4 Oct 2018

1.SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation pdf

Recent techniques in self-supervised monocular depth estimation are approaching the performance of supervised methods, but operate in low resolution only. We show that high resolution is key towards high-fidelity self-supervised monocular depth prediction. Inspired by recent deep learning methods for Single-Image Super-Resolution, we propose a sub-pixel convolutional layer extension for depth super-resolution that accurately synthesizes high-resolution disparities from their corresponding low-resolution convolutional features. In addition, we introduce a differentiable flip-augmentation layer that accurately fuses predictions from the image and its horizontally flipped version, reducing the effect of left and right shadow regions generated in the disparity map due to occlusions. Both contributions provide significant performance gains over the state-of-the-art in self-supervised depth and pose estimation on the public KITTI benchmark. A video of our approach can be found at this https URL

2.Task-Oriented Hand Motion Retargeting for Dexterous Manipulation Imitation pdf

Human hand actions are quite complex, especially when they involve object manipulation, mainly due to the high dimensionality of the hand and the vast action space that entails. Imitating those actions with dexterous hand models involves different important and challenging steps: acquiring human hand information, retargeting it to a hand model, and learning a policy from acquired data. In this work, we capture the hand information by using a state-of-the-art hand pose estimator. We tackle the retargeting problem from the hand pose to a 29 DoF hand model by combining inverse kinematics and PSO with a task objective optimisation. This objective encourages the virtual hand to accomplish the manipulation task, relieving the effect of the estimator's noise and the domain gap. Our approach leads to a better success rate in the grasping task compared to our inverse kinematics baseline, allowing us to record successful human demonstrations. Furthermore, we used these demonstrations to learn a policy network using generative adversarial imitation learning (GAIL) that is able to autonomously grasp an object in the virtual space.

3.An Effective Single-Image Super-Resolution Model Using Squeeze-and-Excitation Networks pdf

Recent works on single-image super-resolution are concentrated on improving performance through enhancing spatial encoding between convolutional layers. In this paper, we focus on modeling the correlations between channels of convolutional features. We present an effective deep residual network based on squeeze-and-excitation blocks (SEBlock) to reconstruct high-resolution (HR) image from low-resolution (LR) image. SEBlock is used to adaptively recalibrate channel-wise feature mappings. Further, short connections between each SEBlock are used to remedy information loss. Extensive experiments show that our model can achieve the state-of-the-art performance and get finer texture details.

4.Weighted Sigmoid Gate Unit for an Activation Function of Deep Neural Network pdf

An activation function has crucial role in a deep neural network.
A simple rectified linear unit (ReLU) are widely used for the activation function.
In this paper, a weighted sigmoid gate unit (WiG) is proposed as the activation function.
The proposed WiG consists of a multiplication of inputs and the weighted sigmoid gate.
It is shown that the WiG includes the ReLU and same activation functions as a special case.
Many activation functions have been proposed to overcome the performance of the ReLU.
In the literature, the performance is mainly evaluated with an object recognition task.
The proposed WiG is evaluated with the object recognition task and the image restoration task.
Then, the expeirmental comparisons demonstrate the proposed WiG overcomes the existing activation functions including the ReLU.

5.SAVOIAS: A Diverse, Multi-Category Visual Complexity Dataset pdf

Visual complexity identifies the level of intricacy and details in an image or the level of difficulty to describe the image. It is an important concept in a variety of areas such as cognitive psychology, computer vision and visualization, and advertisement. Yet, efforts to create large, downloadable image datasets with diverse content and unbiased groundtruthing are lacking. In this work, we introduce Savoias, a visual complexity dataset that compromises of more than 1,400 images from seven image categories relevant to the above research areas, namely Scenes, Advertisements, Visualization and infographics, Objects, Interior design, Art, and Suprematism. The images in each category portray diverse characteristics including various low-level and high-level features, objects, backgrounds, textures and patterns, text, and graphics. The ground truth for Savoias is obtained by crowdsourcing more than 37,000 pairwise comparisons of images using the forced-choice methodology and with more than 1,600 contributors. The resulting relative scores are then converted to absolute visual complexity scores using the Bradley-Terry method and matrix completion. When applying five state-of-the-art algorithms to analyze the visual complexity of the images in the Savoias dataset, we found that the scores obtained from these baseline tools only correlate well with crowdsourced labels for abstract patterns in the Suprematism category (Pearson correlation r=0.84). For the other categories, in particular, the objects and advertisement categories, low correlation coefficients were revealed (r=0.3 and 0.56, respectively). These findings suggest that (1) state-of-the-art approaches are mostly insufficient and (2) Savoias enables category-specific method development, which is likely to improve the impact of visual complexity analysis on specific application areas, including computer vision.

6.A deep learning pipeline for product recognition in store shelves pdf

Recognition of grocery products in store shelves poses peculiar challenges. Firstly, the task mandates the recognition of an extremely high number of different items, in the order of several thousands for medium-small shops, with many of them featuring small inter and intra class variability. Then, available product databases usually include just one or a few studio-quality images per product (refereed to here as reference images), whilst at test time recognition is performed on pictures displaying a portion of a shelf containing several products and taken in the store by cheap cameras (refereed to as query images). Moreover, as the items on sale in a store as well as their appearance change frequently overtime, a practical recognition system should handle seamlessly new products/packages. Inspired by recent advances in object detection and image retrieval, we propose to leverage on state of the art object detectors based on deep learning to obtain an initial product-agnostic item detection. Then, we pursue product recognition through similarity search between global descriptors computed on reference and cropped query images. To maximize performance, we learn an ad-hoc global descriptor by a CNN trained on reference images based on an image embedding loss. Our system is computationally expensive at training time, but can perform recognition rapidly and accurately at test time.

7.2018 Low-Power Image Recognition Challenge pdf

The Low-Power Image Recognition Challenge (LPIRC, this https URL) is an annual competition started in 2015. The competition identifies the best technologies that can classify and detect objects in images efficiently (short execution time and low energy consumption) and accurately (high precision). Over the four years, the winners' scores have improved more than 24 times. As computer vision is widely used in many battery-powered systems (such as drones and mobile phones), the need for low-power computer vision will become increasingly important. This paper summarizes LPIRC 2018 by describing the three different tracks and the winners' solutions.

8.A Robot Localization Framework Using CNNs for Object Detection and Pose Estimation pdf

External localization is an essential part for the indoor operation of small or cost-efficient robots, as they are used, for example, in swarm robotics. We introduce a two-stage localization and instance identification framework for arbitrary robots based on convolutional neural networks. Object detection is performed on an external camera image of the operation zone providing robot bounding boxes for an identification and orientation estimation convolutional neural network. Additionally, we propose a process to generate the necessary training data. The framework was evaluated with 3 different robot types and various identification patterns. We have analyzed the main framework hyperparameters providing recommendations for the framework operation settings. We achieved up to 98% mAP@IOU0.5 and only 1.6{\deg} orientation error, running with a frame rate of 50 Hz on a GPU.

9.PIRM Challenge on Perceptual Image Enhancement on Smartphones: Report pdf

This paper reviews the first challenge on efficient perceptual image enhancement with the focus on deploying deep learning models on smartphones. The challenge consisted of two tracks. In the first one, participants were solving the classical image super-resolution problem with a bicubic downscaling factor of 4. The second track was aimed at real-world photo enhancement, and the goal was to map low-quality photos from the iPhone 3GS device to the same photos captured with a DSLR camera. The target metric used in this challenge combined the runtime, PSNR scores and solutions' perceptual results measured in the user study. To ensure the efficiency of the submitted models, we additionally measured their runtime and memory requirements on Android smartphones. The proposed solutions significantly improved baseline results defining the state-of-the-art for image enhancement on smartphones.

10.Optimization Algorithm Inspired Deep Neural Network Structure Design pdf

Deep neural networks have been one of the dominant machine learning approaches in recent years. Several new network structures are proposed and have better performance than the traditional feedforward neural network structure. Representative ones include the skip connection structure in ResNet and the dense connection structure in DenseNet. However, it still lacks a unified guidance for the neural network structure design. In this paper, we propose the hypothesis that the neural network structure design can be inspired by optimization algorithms and a faster optimization algorithm may lead to a better neural network structure. Specifically, we prove that the propagation in the feedforward neural network with the same linear transformation in different layers is equivalent to minimizing some function using the gradient descent algorithm. Based on this observation, we replace the gradient descent algorithm with the heavy ball algorithm and Nesterov's accelerated gradient descent algorithm, which are faster and inspire us to design new and better network structures. ResNet and DenseNet can be considered as two special cases of our framework. Numerical experiments on CIFAR-10, CIFAR-100 and ImageNet verify the advantage of our optimization algorithm inspired structures over ResNet and DenseNet.

11.Extreme Augmentation : Can deep learning based medical image segmentation be trained using a single manually delineated scan? pdf

Yes, it can. Data augmentation is perhaps the oldest preprocessing step in computer vision literature. Almost every computer vision model trained on imaging data uses some form of augmentation. In this paper, we use the inter-vertebral disk segmentation task alongside a deep residual U-Net as the learning model, to explore the effectiveness of augmentation. In the extreme, we observed that a model trained on patches extracted from just one scan, with each patch augmented 50 times; achieved a Dice score of 0.73 in a validation set of 40 cases. Qualitative evaluation indicated a clinically usable segmentation algorithm, which appropriately segments regions of interest, alongside limited false positive specks. When the initial patches are extracted from nine scans the average Dice coefficient jumps to 0.86 and most of the false positives disappear. While this still falls short of state-of-the-art deep learning based segmentation of discs reported in literature, qualitative examination reveals that it does yield segmentation, which can be amended by expert clinicians with minimal effort to generate additional data for training improved deep models. Extreme augmentation of training data, should thus be construed as a strategy for training deep learning based algorithms, when very little manually annotated data is available to work with. Models trained with extreme augmentation can then be used to accelerate the generation of manually labelled data. Hence, we show that extreme augmentation can be a valuable tool in addressing scaling up small imaging data sets to address medical image segmentation tasks.

12.Cascaded Pyramid Network for 3D Human Pose Estimation Challenge pdf

Over the past decade, there has been a growing interest in human pose estimation. Although much work has been done on 2D pose estimation, 3D pose estimation has still been relatively studied less. In this paper, we propose a top-bottom based two-stage 3D estimation framework. GloabalNet and RefineNet in our 2D pose estimation process enable us to find occluded or invisible 2D joints while 2D-to-3D pose estimator composed of residual blocks is used to lift 2D joints to 3D joints effectively. The proposed method achieves promising results with mean per joint position error at 42.39 on the validation dataset on `3D Human Pose Estimation within the ECCV 2018 PoseTrack Challenge.'

13.Primitive Fitting Using Deep Boundary Aware Geometric Segmentation pdf

To identify and fit geometric primitives (e.g., planes, spheres, cylinders, cones) in a noisy point cloud is a challenging yet beneficial task for fields such as robotics and reverse engineering. As a multi-model multi-instance fitting problem, it has been tackled with different approaches including RANSAC, which however often fit inferior models in practice with noisy inputs of cluttered scenes. Inspired by the corresponding human recognition process, and benefiting from the recent advancements in image semantic segmentation using deep neural networks, we propose BAGSFit as a new framework addressing this problem. Firstly, through a fully convolutional neural network, the input point cloud is point-wisely segmented into multiple classes divided by jointly detected instance boundaries without any geometric fitting. Thus, segments can serve as primitive hypotheses with a probability estimation of associating primitive classes. Finally, all hypotheses are sent through a geometric verification to correct any misclassification by fitting primitives respectively. We performed training using simulated range images and tested it with both simulated and real-world point clouds. Quantitative and qualitative experiments demonstrated the superiority of BAGSFit.

14.Deep Fundamental Matrix Estimation without Correspondences pdf

Estimating fundamental matrices is a classic problem in computer vision. Traditional methods rely heavily on the correctness of estimated key-point correspondences, which can be noisy and unreliable. As a result, it is difficult for these methods to handle image pairs with large occlusion or significantly different camera poses. In this paper, we propose novel neural network architectures to estimate fundamental matrices in an end-to-end manner without relying on point correspondences. New modules and layers are introduced in order to preserve mathematical properties of the fundamental matrix as a homogeneous rank-2 matrix with seven degrees of freedom. We analyze performance of the proposed models using various metrics on the KITTI dataset, and show that they achieve competitive performance with traditional methods without the need for extracting correspondences.

15.Assessing Performance of Aerobic Routines using Background Subtraction and Intersected Image Region pdf

It is recommended for a novice to engage a trained and experience person, i.e., a coach before starting an unfamiliar aerobic or weight routine. The coach's task is to provide real-time feedbacks to ensure that the routine is performed in a correct manner. This greatly reduces the risk of injury and maximise physical gains. We present a simple image similarity measure based on intersected image region to assess a subject's performance of an aerobic routine. The method is implemented inside an Augmented Reality (AR) desktop app that employs a single RGB camera to capture still images of the subject as he or she progresses through the routine. The background-subtracted body pose image is compared against the exemplar body pose image (i.e., AR template) at specific intervals. Based on a limited dataset, our pose matching function is reported to have an accuracy of 93.67%.

16.Performance Evaluation of SIFT Descriptor against Common Image Deformations on Iban Plaited Mat Motifs pdf

Borneo indigenous communities are blessed with rich craft heritage. One such examples is the Iban's plaited mat craft. There have been many efforts by UNESCO and the Sarawak Government to preserve and promote the craft. One such method is by developing a mobile app capable of recognising the different mat motifs. As a first step towards this aim, we presents a novel image dataset consisting of seven mat motif classes. Each class possesses a unique variation of chevrons, diagonal shapes, symmetrical, repetitive, geometric and non geometric patterns. In this study, the performance of the Scale invariant feature transform (SIFT) descriptor is evaluated against five common image deformations, i.e., zoom and rotation, viewpoint, image blur, JPEG compression and illumination. Using our dataset, SIFT performed favourably with test sequences belonging to Illumination changes, Viewpoint changes, JPEG compression and Zoom and Rotation. However, it did not performed well with Image blur test sequences with an average of 1.61 percents retained pairwise matching after blurring with a Gaussian kernel of 8.0 radius.

17.Image as Data: Automated Visual Content Analysis for Political Science pdf

Image data provide unique information about political events, actors, and their interactions which are difficult to measure from or not available in text data. This article introduces a new class of automated methods based on computer vision and deep learning which can automatically analyze visual content data. Scholars have already recognized the importance of visual data and a variety of large visual datasets have become available. The lack of scalable analytic methods, however, has prevented from incorporating large scale image data in political analysis. This article aims to offer an in-depth overview of automated methods for visual content analysis and explains their usages and implementations. We further elaborate on how these methods and results can be validated and interpreted. We then discuss how these methods can contribute to the study of political communication, identity and politics, development, and conflict, by enabling a new set of research questions at scale.

18.Representation Flow for Action Recognition pdf

In this paper, we propose a convolutional layer inspired by optical flow algorithms to learn motion representations. Our representation flow layer is a fully-differentiable layer designed to optimally capture the flow' of any representation channel within a convolutional neural network. Its parameters for iterative flow optimization are learned in an end-to-end fashion together with the other model parameters, maximizing the action recognition performance. Furthermore, we newly introduce the concept of learning flow of flow' representations by stacking multiple representation flow layers. We conducted extensive experimental evaluations, confirming its advantages over previous recognition models using traditional optical flows in both computational speed and performance.

19.Human-Centered Autonomous Vehicle Systems: Principles of Effective Shared Autonomy pdf

Building effective, enjoyable, and safe autonomous vehicles is a lot harder than has historically been considered. The reason is that, simply put, an autonomous vehicle must interact with human beings. This interaction is not a robotics problem nor a machine learning problem nor a psychology problem nor an economics problem nor a policy problem. It is all of these problems put into one. It challenges our assumptions about the limitations of human beings at their worst and the capabilities of artificial intelligence systems at their best. This work proposes a set of principles for designing and building autonomous vehicles in a human-centered way that does not run away from the complexity of human nature but instead embraces it. We describe our development of the Human-Centered Autonomous Vehicle (HCAV) as an illustrative case study of implementing these principles in practice.

20.Theory of Generative Deep Learning : Probe Landscape of Empirical Error via Norm Based Capacity Control pdf

Despite its remarkable empirical success as a highly competitive branch of artificial intelligence, deep learning is often blamed for its widely known low interpretation and lack of firm and rigorous mathematical foundation. However, most theoretical endeavor is devoted in discriminative deep learning case, whose complementary part is generative deep learning. To the best of our knowledge, we firstly highlight landscape of empirical error in generative case to complete the full picture through exquisite design of image super resolution under norm based capacity control. Our theoretical advance in interpretation of the training dynamic is achieved from both mathematical and biological sides.

21.Towards WARSHIP: Combining Components of Brain-Inspired Computing of RSH for Image Super Resolution pdf

Evolution of deep learning shows that some algorithmic tricks are more durable , while others are not. To the best of our knowledge, we firstly summarize 5 more durable and complete deep learning components for vision, that is, WARSHIP. Moreover, we give a biological overview of WARSHIP, emphasizing brain-inspired computing of WARSHIP. As a step towards WARSHIP, our case study of image super resolution combines 3 components of RSH to deploy a CNN model of WARSHIP-XZNet, which performs a happy medium between speed and performance.

22.DeepCMB: Lensing Reconstruction of the Cosmic Microwave Background with Deep Neural Networks pdf

Next-generation cosmic microwave background (CMB) experiments will have lower noise and therefore increased sensitivity, enabling improved constraints on fundamental physics parameters such as the sum of neutrino masses and the tensor-to-scalar ratio r. Achieving competitive constraints on these parameters requires high signal-to-noise extraction of the projected gravitational potential from the CMB maps. Standard methods for reconstructing the lensing potential employ the quadratic estimator (QE). However, the QE performs suboptimally at the low noise levels expected in upcoming experiments. Other methods, like maximum likelihood estimators (MLE), are under active development. In this work, we demonstrate reconstruction of the CMB lensing potential with deep convolutional neural networks (CNN) - ie, a ResUNet. The network is trained and tested on simulated data, and otherwise has no physical parametrization related to the physical processes of the CMB and gravitational lensing. We show that, over a wide range of angular scales, ResUNets recover the input gravitational potential with a higher signal-to-noise ratio than the QE method, reaching levels comparable to analytic approximations of MLE methods. We demonstrate that the network outputs quantifiably different lensing maps when given input CMB maps generated with different cosmologies. We also show we can use the reconstructed lensing map for cosmological parameter estimation. This application of CNN provides a few innovations at the intersection of cosmology and machine learning. First, while training and regressing on images, we predict a continuous-variable field rather than discrete classes. Second, we are able to establish uncertainty measures for the network output that are analogous to standard methods. We expect this approach to excel in capturing hard-to-model non-Gaussian astrophysical foreground and noise contributions.

23.Scientific image rendering for space scenes with the SurRender software pdf

Spacecraft autonomy can be enhanced by vision-based navigation (VBN) techniques. Applications range from manoeuvers around Solar System objects and landing on planetary surfaces, to in-orbit servicing or space debris removal. The development and validation of VBN algorithms relies on the availability of physically accurate relevant images. Yet archival data from past missions can rarely serve this purpose and acquiring new data is often costly. The SurRender software is an image simulator that addresses the challenges of realistic image rendering, with high representativeness for space scenes. Images are rendered by raytracing, which implements the physical principles of geometrical light propagation, in physical units. A macroscopic instrument model and scene objects reflectance functions are used. SurRender is specially optimized for space scenes, with huge distances between objects and scenes up to Solar System size. Raytracing conveniently tackles some important effects for VBN algorithms: image quality, eclipses, secondary illumination, subpixel limb imaging, etc. A simulation is easily setup (in MATLAB, Python, and more) by specifying the position of the bodies (camera, Sun, planets, satellites) over time, 3D shapes and material surface properties. SurRender comes with its own modelling tool enabling to go beyond existing models for shapes, materials and sensors (projection, temporal sampling, electronics, etc.). It is natively designed to simulate different kinds of sensors (visible, LIDAR, etc.). Tools are available for manipulating huge datasets to store albedo maps and digital elevation models, or for procedural (fractal) texturing that generates high-quality images for a large range of observing distances (from millions of km to touchdown). We illustrate SurRender performances with a selection of case studies, placing particular emphasis on a 900-km Moon flyby simulation.

24.FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models pdf

A promising class of generative models maps points from a simple distribution to a complex distribution through an invertible neural network. Likelihood-based training of these models requires restricting their architectures to allow cheap computation of Jacobian determinants. Alternatively, the Jacobian trace can be used if the transformation is specified by an ordinary differential equation. In this paper, we use Hutchinson's trace estimator to give a scalable unbiased estimate of the log-density. The result is a continuous-time invertible generative model with unbiased density estimation and one-pass sampling, while allowing unrestricted neural network architectures. We demonstrate our approach on high-dimensional density estimation, image generation, and variational inference, achieving the state-of-the-art among exact likelihood methods with efficient sampling.

25.Sinkhorn AutoEncoders pdf

Optimal Transport offers an alternative to maximum likelihood for learning generative autoencoding models. We show how this principle dictates the minimization of the Wasserstein distance between the encoder aggregated posterior and the prior, plus a reconstruction error. We prove that in the non-parametric limit the autoencoder generates the data distribution if and only if the two distributions match exactly, and that the optimum can be obtained by deterministic autoencoders. We then introduce the Sinkhorn AutoEncoder (SAE), which casts the problem into Optimal Transport on the latent space. The resulting Wasserstein distance is minimized by backpropagating through the Sinkhorn algorithm. SAE models the aggregated posterior as an implicit distribution and therefore does not need a reparameterization trick for gradients estimation. Moreover, it requires virtually no adaptation to different prior distributions. We demonstrate its flexibility by considering models with hyperspherical and Dirichlet priors, as well as a simple case of probabilistic programming. SAE matches or outperforms other autoencoding models in visual quality and FID scores.