Skip to content

Latest commit

 

History

History
105 lines (105 loc) · 65.9 KB

20201015.md

File metadata and controls

105 lines (105 loc) · 65.9 KB

ArXiv cs.CV --Thu, 15 Oct 2020

1.Self-Supervised Ranking for Representation Learning ⬇️

We present a new framework for self-supervised representation learning by positing it as a ranking problem in an image retrieval context on a large number of random views from random sets of images. Our work is based on two intuitive observations: first, a good representation of images must yield a high-quality image ranking in a retrieval task; second, we would expect random views of an image to be ranked closer to a reference view of that image than random views of other images. Hence, we model representation learning as a learning-to-rank problem in an image retrieval context, and train it by maximizing average precision (AP) for ranking. Specifically, given a mini-batch of images, we generate a large number of positive/negative samples and calculate a ranking loss term by separately treating each image view as a retrieval query. The new framework, dubbed S2R2, enables computing a global objective compared to the local objective in the popular contrastive learning framework calculated on pairs of views. A global objective leads S2R2 to faster convergence in terms of the number of epochs. In principle, by using a ranking criterion, we eliminate reliance on object-centered curated datasets (e.g., ImageNet). When trained on STL10 and MS-COCO, S2R2 outperforms SimCLR and performs on par with the state-of-the-art clustering-based contrastive learning model, SwAV, while being much simpler both conceptually and implementation-wise. Furthermore, when trained on a small subset of MS-COCO with fewer similar scenes, S2R2 significantly outperforms both SwAV and SimCLR. This indicates that S2R2 is potentially more effective on diverse scenes and decreases the need for a large training dataset for self-supervised learning.

2.Vision-Aided Radio: User Identity Match in Radio and Video Domains Using Machine Learning ⬇️

5G is designed to be an essential enabler and a leading infrastructure provider in the communication technology industry by supporting the demand for the growing data traffic and a variety of services with distinct requirements. The use of deep learning and computer vision tools has the means to increase the environmental awareness of the network with information from visual data. Information extracted via computer vision tools such as user position, movement direction, and speed can be promptly available for the network. However, the network must have a mechanism to match the identity of a user in both visual and radio systems. This mechanism is absent in the present literature. Therefore, we propose a framework to match the information from both visual and radio domains. This is an essential step to practical applications of computer vision tools in communications. We detail the proposed framework training and deployment phases for a presented setup. We carried out practical experiments using data collected in different types of environments. The work compares the use of Deep Neural Network and Random Forest classifiers and shows that the former performed better across all experiments, achieving classification accuracy greater than 99%.

3.Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning ⬇️

In this paper we show that learning video feature spaces in which temporal cycles are maximally predictable benefits action classification. In particular, we propose a novel learning approach termed Cycle Encoding Prediction~(CEP) that is able to effectively represent high-level spatio-temporal structure of unlabelled video content. CEP builds a latent space wherein the concept of closed forward-backward as well as backward-forward temporal loops is approximately preserved. As a self-supervision signal, CEP leverages the bi-directional temporal coherence of the video stream and applies loss functions that encourage both temporal cycle closure as well as contrastive feature separation. Architecturally, the underpinning network structure utilises a single feature encoder for all video snippets, adding two predictive modules that learn temporal forward and backward transitions. We apply our framework for pretext training of networks for action recognition tasks. We report significantly improved results for the standard datasets UCF101 and HMDB51. Detailed ablation studies support the effectiveness of the proposed components. We publish source code for the CEP components in full with this paper.

4.Manifold-Net: Using Manifold Learning for Point Cloud Classification ⬇️

In this paper, we propose a point cloud classification method based on graph neural network and manifold learning. Different from the conventional point cloud analysis methods, this paper uses manifold learning algorithms to embed point cloud features for better considering the geometric continuity on the surface. Then, the nature of point cloud can be acquired in low dimensional space, and after being concatenated with features in the original three-dimensional (3D)space, both the capability of feature representation and the classification network performance can be improved. We pro-pose two manifold learning modules, where one is based on locally linear embedding algorithm, and the other is a non-linear projection method based on neural network architecture. Both of them can obtain better performances than the state-of-the-art baseline. Afterwards, the graph model is constructed by using the k nearest neighbors algorithm, where the edge features are effectively aggregated for the implementation of point cloud classification. Experiments show that the proposed point cloud classification methods obtain the mean class accuracy (mA) of 90.2% and the overall accuracy (oA)of 93.2%, which reach competitive performances compared with the existing state-of-the-art related methods.

5.Learning Propagation Rules for Attribution Map Generation ⬇️

Prior gradient-based attribution-map methods rely on handcrafted propagation rules for the non-linear/activation layers during the backward pass, so as to produce gradients of the input and then the attribution map. Despite the promising results achieved, such methods are sensitive to the non-informative high-frequency components and lack adaptability for various models and samples. In this paper, we propose a dedicated method to generate attribution maps that allow us to learn the propagation rules automatically, overcoming the flaws of the handcrafted ones. Specifically, we introduce a learnable plugin module, which enables adaptive propagation rules for each pixel, to the non-linear layers during the backward pass for mask generating. The masked input image is then fed into the model again to obtain new output that can be used as a guidance when combined with the original one. The introduced learnable module can be trained under any auto-grad framework with higher-order differential support. As demonstrated on five datasets and six network architectures, the proposed method yields state-of-the-art results and gives cleaner and more visually plausible attribution maps.

6.A Vector-based Representation to Enhance Head Pose Estimation ⬇️

This paper proposes to use the three vectors in a rotation matrix as the representation in head pose estimation and develops a new neural network based on the characteristic of such representation. We address two potential issues existed in current head pose estimation works: 1. Public datasets for head pose estimation use either Euler angles or quaternions to annotate data samples. However, both of these annotations have the issue of discontinuity and thus could result in some performance issues in neural network training. 2. Most research works report Mean Absolute Error (MAE) of Euler angles as the measurement of performance. We show that MAE may not reflect the actual behavior especially for the cases of profile views. To solve these two problems, we propose a new annotation method which uses three vectors to describe head poses and a new measurement Mean Absolute Error of Vectors (MAEV) to assess the performance. We also train a new neural network to predict the three vectors with the constraints of orthogonality. Our proposed method achieves state-of-the-art results on both AFLW2000 and BIWI datasets. Experiments show our vector-based annotation method can effectively reduce prediction errors for large pose angles.

7.WeightAlign: Normalizing Activations by Weight Alignment ⬇️

Batch normalization (BN) allows training very deep networks by normalizing activations by mini-batch sample statistics which renders BN unstable for small batch sizes. Current small-batch solutions such as Instance Norm, Layer Norm, and Group Norm use channel statistics which can be computed even for a single sample. Such methods are less stable than BN as they critically depend on the statistics of a single input sample. To address this problem, we propose a normalization of activation without sample statistics. We present WeightAlign: a method that normalizes the weights by the mean and scaled standard derivation computed within a filter, which normalizes activations without computing any sample statistics. Our proposed method is independent of batch size and stable over a wide range of batch sizes. Because weight statistics are orthogonal to sample statistics, we can directly combine WeightAlign with any method for activation normalization. We experimentally demonstrate these benefits for classification on CIFAR-10, CIFAR-100, ImageNet, for semantic segmentation on PASCAL VOC 2012 and for domain adaptation on Office-31.

8.Multi-class segmentation under severe class imbalance: A case study in roof damage assessment ⬇️

The task of roof damage classification and segmentation from overhead imagery presents unique challenges. In this work we choose to address the challenge posed due to strong class imbalance. We propose four distinct techniques that aim at mitigating this problem. Through a new scheme that feeds the data to the network by oversampling the minority classes, and three other network architectural improvements, we manage to boost the macro-averaged F1-score of a model by 39.9 percentage points, thus achieving improved segmentation performance, especially on the minority classes.

9.Online Anomaly Detection in Surveillance Videos with Asymptotic Bounds on False Alarm Rate ⬇️

Anomaly detection in surveillance videos is attracting an increasing amount of attention. Despite the competitive performance of recent methods, they lack theoretical performance analysis, particularly due to the complex deep neural network architectures used in decision making. Additionally, online decision making is an important but mostly neglected factor in this domain. Much of the existing methods that claim to be online, depend on batch or offline processing in practice. Motivated by these research gaps, we propose an online anomaly detection method in surveillance videos with asymptotic bounds on the false alarm rate, which in turn provides a clear procedure for selecting a proper decision threshold that satisfies the desired false alarm rate. Our proposed algorithm consists of a multi-objective deep learning module along with a statistical anomaly detection module, and its effectiveness is demonstrated on several publicly available data sets where we outperform the state-of-the-art algorithms. All codes are available at this https URL.

10.A New Distributional Ranking Loss With Uncertainty: Illustrated in Relative Depth Estimation ⬇️

We propose a new approach for the problem of relative depth estimation from a single image. Instead of directly regressing over depth scores, we formulate the problem as estimation of a probability distribution over depth and aim to learn the parameters of the distributions which maximize the likelihood of the given data. To train our model, we propose a new ranking loss, Distributional Loss, which tries to increase the probability of farther pixel's depth being greater than the closer pixel's depth. Our proposed approach allows our model to output confidence in its estimation in the form of standard deviation of the distribution. We achieve state of the art results against a number of baselines while providing confidence in our estimations. Our analysis show that estimated confidence is actually a good indicator of accuracy. We investigate the usage of confidence information in a downstream task of metric depth estimation, to increase its performance.

11.Better Patch Stitching for Parametric Surface Reconstruction ⬇️

Recently, parametric mappings have emerged as highly effective surface representations, yielding low reconstruction error. In particular, the latest works represent the target shape as an atlas of multiple mappings, which can closely encode object parts. Atlas representations, however, suffer from one major drawback: The individual mappings are not guaranteed to be consistent, which results in holes in the reconstructed shape or in jagged surface areas.
We introduce an approach that explicitly encourages global consistency of the local mappings. To this end, we introduce two novel loss terms. The first term exploits the surface normals and requires that they remain locally consistent when estimated within and across the individual mappings. The second term further encourages better spatial configuration of the mappings by minimizing novel stitching error. We show on standard benchmarks that the use of normal consistency requirement outperforms the baselines quantitatively while enforcing better stitching leads to much better visual quality of the reconstructed objects as compared to the state-of-the-art.

12.FC-DCNN: A densely connected neural network for stereo estimation ⬇️

We propose a novel lightweight network for stereo estimation. Our network consists of a fully-convolutional densely connected neural network (FC-DCNN) that computes matching costs between rectified image pairs. Our FC-DCNN method learns expressive features and performs some simple but effective post-processing steps. The densely connected layer structure connects the output of each layer to the input of each subsequent layer. This network structure and the fact that we do not use any fully-connected layers or 3D convolutions leads to a very lightweight network. The output of this network is used in order to calculate matching costs and create a cost-volume. Instead of using time and memory-inefficient cost-aggregation methods such as semi-global matching or conditional random fields in order to improve the result, we rely on filtering techniques, namely median filter and guided filter. By computing a left-right consistency check we get rid of inconsistent values. Afterwards we use a watershed foreground-background segmentation on the disparity image with removed inconsistencies. This mask is then used to refine the final prediction. We show that our method works well for both challenging indoor and outdoor scenes by evaluating it on the Middlebury, KITTI and ETH3D benchmarks respectively. Our full framework is available at this https URL

13.Relative Depth Estimation as a Ranking Problem ⬇️

We present a formulation of the relative depth estimation from a single image problem, as a ranking problem. By reformulating the problem this way, we were able to utilize literature on the ranking problem, and apply the existing knowledge to achieve better results. To this end, we have introduced a listwise ranking loss borrowed from ranking literature, weighted ListMLE, to the relative depth estimation problem. We have also brought a new metric which considers pixel depth ranking accuracy, on which our method is stronger.

14.Deep Learning from Small Amount of Medical Data with Noisy Labels: A Meta-Learning Approach ⬇️

Computer vision systems recently made a big leap thanks to deep neural networks. However, these systems require correctly labeled large datasets in order to be trained properly, which is very difficult to obtain for medical applications. Two main reasons for label noise in medical applications are the high complexity of the data and conflicting opinions of experts. Moreover, medical imaging datasets are commonly tiny, which makes each data very important in learning. As a result, if not handled properly, label noise significantly degrades the performance. Therefore, we propose a label-noise-robust learning algorithm that makes use of the meta-learning paradigm. We tested our proposed solution on retinopathy of prematurity (ROP) dataset with a very high label noise of 68%. Our results show that the proposed algorithm significantly improves the classification algorithm's performance in the presence of noisy labels.

15.PP-LinkNet: Improving Semantic Segmentation of High Resolution Satellite Imagery with Multi-stage Training ⬇️

Road network and building footprint extraction is essential for many applications such as updating maps, traffic regulations, city planning, ride-hailing, disaster response \textit{etc}. Mapping road networks is currently both expensive and labor-intensive. Recently, improvements in image segmentation through the application of deep neural networks has shown promising results in extracting road segments from large scale, high resolution satellite imagery. However, significant challenges remain due to lack of enough labeled training data needed to build models for industry grade applications. In this paper, we propose a two-stage transfer learning technique to improve robustness of semantic segmentation for satellite images that leverages noisy pseudo ground truth masks obtained automatically (without human labor) from crowd-sourced OpenStreetMap (OSM) data. We further propose Pyramid Pooling-LinkNet (PP-LinkNet), an improved deep neural network for segmentation that uses focal loss, poly learning rate, and context module. We demonstrate the strengths of our approach through evaluations done on three popular datasets over two tasks, namely, road extraction and building foot-print detection. Specifically, we obtain 78.19% meanIoU on SpaceNet building footprint dataset, 67.03% and 77.11% on the road topology metric on SpaceNet and DeepGlobe road extraction dataset, respectively.

16.AMPA-Net: Optimization-Inspired Attention Neural Network for Deep Compressed Sensing ⬇️

Compressed sensing (CS) is a challenging problem in image processing due to reconstructing an almost complete image from a limited measurement. To achieve fast and accurate CS reconstruction, we synthesize the advantages of two well-known methods (neural network and optimization algorithm) to propose a novel optimization inspired neural network which dubbed AMP-Net. AMP-Net realizes the fusion of the Approximate Message Passing (AMP) algorithm and neural network. All of its parameters are learned automatically. Furthermore, we propose an AMPA-Net which uses three attention networks to improve the representation ability of AMP-Net. Finally, We demonstrate the effectiveness of AMP-Net and AMPA-Net on four CS reconstruction benchmark data sets.

17.Development of Open Informal Dataset Affecting Autonomous Driving ⬇️

This document is a document that has written procedures and methods for collecting objects and unstructured dynamic data on the road for the development of object recognition technology for self-driving cars, and outlines the methods of collecting data, annotation data, object classifier criteria, and data processing methods. On-road object and unstructured dynamic data were collected in various environments, such as weather, time and traffic conditions, and additional reception calls for police and safety personnel were collected. Finally, 100,000 images of various objects existing on pedestrians and roads, 200,000 images of police and traffic safety personnel, 5,000 images of police and traffic safety personnel, and data sets consisting of 5,000 image data were collected and built.

18.Adaptive-Attentive Geolocalization from few queries: a hybrid approach ⬇️

We address the task of cross-domain visual place recognition, where the goal is to geolocalize a given query image against a labeled gallery, in the case where the query and the gallery belong to different visual domains. To achieve this, we focus on building a domain robust deep network by leveraging over an attention mechanism combined with few-shot unsupervised domain adaptation techniques, where we use a small number of unlabeled target domain images to learn about the target distribution. With our method, we are able to outperform the current state of the art while using two orders of magnitude less target domain images. Finally we propose a new large-scale dataset for cross-domain visual place recognition, called SVOX. Upon acceptance of the paper, code and dataset will be released.

19.Semantic Segmentation for Partially Occluded Apple Trees Based on Deep Learning ⬇️

Fruit tree pruning and fruit thinning require a powerful vision system that can provide high resolution segmentation of the fruit trees and their branches. However, recent works only consider the dormant season, where there are minimal occlusions on the branches or fit a polynomial curve to reconstruct branch shape and hence, losing information about branch thickness. In this work, we apply two state-of-the-art supervised learning models U-Net and DeepLabv3, and a conditional Generative Adversarial Network Pix2Pix (with and without the discriminator) to segment partially occluded 2D-open-V apple trees. Binary accuracy, Mean IoU, Boundary F1 score and Occluded branch recall were used to evaluate the performances of the models. DeepLabv3 outperforms the other models at Binary accuracy, Mean IoU and Boundary F1 score, but is surpassed by Pix2Pix (without discriminator) and U-Net in Occluded branch recall. We define two difficulty indices to quantify the difficulty of the task: (1) Occlusion Difficulty Index and (2) Depth Difficulty Index. We analyze the worst 10 images in both difficulty indices by means of Branch Recall and Occluded Branch Recall. U-Net outperforms the other two models in the current metrics. On the other hand, Pix2Pix (without discriminator) provides more information on branch paths, which are not reflected by the metrics. This highlights the need for more specific metrics on recovering occluded information. Furthermore, this shows the usefulness of image-transfer networks for hallucination behind occlusions. Future work is required to further enhance the models to recover more information from occlusions such that this technology can be applied to automating agricultural tasks in a commercial environment.

20.Semantic Flow-guided Motion Removal Method for Robust Mapping ⬇️

Moving objects in scenes are still a severe challenge for the SLAM system. Many efforts have tried to remove the motion regions in the images by detecting moving objects. In this way, the keypoints belonging to motion regions will be ignored in the later calculations. In this paper, we proposed a novel motion removal method, leveraging semantic information and optical flow to extract motion regions. Different from previous works, we don't predict moving objects or motion regions directly from image sequences. We computed rigid optical flow, synthesized by the depth and pose, and compared it against the estimated optical flow to obtain initial motion regions. Then, we utilized K-means to finetune the motion region masks with instance segmentation masks. The ORB-SLAM2 integrated with the proposed motion removal method achieved the best performance in both indoor and outdoor dynamic environments.

21.Multi-Scale Networks for 3D Human PoseEstimation with Inference Stage Optimization ⬇️

Estimating 3D human poses from a monocular video is still a challenging task. Many existing methods' performance drops when the target person is occluded by other objects, or the motion is too fast/slow relative to the scale and speed of the training data. Moreover, many of these methods are not designed or trained under severe occlusion explicitly, making their performance on handling occlusion compromised. Addressing these problems, we introduce a spatio-temporal network for robust 3D human pose estimation. As humans in videos may appear in different scales and have various motion speeds, we apply multi-scale spatial features for 2D joints or keypoints prediction in each individual frame, and multi-stride temporal convolutional networks (TCNs) to estimate 3D joints or keypoints. Furthermore, we design a spatio-temporal discriminator based on body structures as well as limb motions to assess whether the predicted pose forms a valid pose and a valid movement. During training, we explicitly mask out some keypoints to simulate various occlusion cases, from minor to severe occlusion, so that our network can learn better and becomes robust to various degrees of occlusion. As there are limited 3D ground-truth data, we further utilize 2D video data to inject a semi-supervised learning capability to our network. Moreover, we observe that there is a discrepancy between 3D pose prediction and 2D pose estimation due to different pose variations between video and image training datasets. We, therefore propose a confidence-based inference stage optimization to adaptively enforce 3D pose projection to match 2D pose estimation to further improve final pose prediction accuracy. Experiments on public datasets validate the effectiveness of our method, and our ablation studies show the strengths of our network's individual submodules.

22.Towards Optimal Filter Pruning with Balanced Performance and Pruning Speed ⬇️

Filter pruning has drawn more attention since resource constrained platform requires more compact model for deployment. However, current pruning methods suffer either from the inferior performance of one-shot methods, or the expensive time cost of iterative training methods. In this paper, we propose a balanced filter pruning method for both performance and pruning speed. Based on the filter importance criteria, our method is able to prune a layer with approximate layer-wise optimal pruning rate at preset loss variation. The network is pruned in the layer-wise way without the time consuming prune-retrain iteration. If a pre-defined pruning rate for the entire network is given, we also introduce a method to find the corresponding loss variation threshold with fast converging speed. Moreover, we propose the layer group pruning and channel selection mechanism for channel alignment in network with short connections. The proposed pruning method is widely applicable to common architectures and does not involve any additional training except the final fine-tuning. Comprehensive experiments show that our method outperforms many state-of-the-art approaches.

23.Ferrograph image classification ⬇️

It has been challenging to identify ferrograph images with a small dataset and various scales of wear particle. A novel model is proposed in this study to cope with these challenging problems. For the problem of insufficient samples, we first proposed a data augmentation algorithm based on the permutation of image patches. Then, an auxiliary loss function of image patch permutation recognition was proposed to identify the image generated by the data augmentation algorithm. Moreover, we designed a feature extraction loss function to force the proposed model to extract more abundant features and to reduce redundant representations. As for the challenge of large change range of wear particle size, we proposed a multi-scale feature extraction block to obtain the multi-scale representations of wear particles. We carried out experiments on a ferrograph image dataset and a mini-CIFAR-10 dataset. Experimental results show that the proposed model can improve the accuracy of the two datasets by 9% and 20% respectively compared with the baseline.

24.Rotation Averaging with Attention Graph Neural Networks ⬇️

In this paper we propose a real-time and robust solution to large-scale multiple rotation averaging. Until recently, Multiple rotation averaging problem had been solved using conventional iterative optimization algorithms. Such methods employed robust cost functions that were chosen based on assumptions made about the sensor noise and outlier distribution. In practice, these assumptions do not always fit real datasets very well. A recent work showed that the noise distribution could be learnt using a graph neural network. This solution required a second network for outlier detection and removal as the averaging network was sensitive to a poor initialization. In this paper we propose a single-stage graph neural network that can robustly perform rotation averaging in the presence of noise and outliers. Our method uses all observations, suppressing outliers effects through the use of weighted averaging and an attention mechanism within the network design. The result is a network that is faster, more robust and can be trained with less samples than the previous neural approach, ultimately outperforming conventional iterative algorithms in accuracy and in inference times.

25.Are all negatives created equal in contrastive instance discrimination? ⬇️

Self-supervised learning has recently begun to rival supervised learning on computer vision tasks. Many of the recent approaches have been based on contrastive instance discrimination (CID), in which the network is trained to recognize two augmented versions of the same instance (a query and positive) while discriminating against a pool of other instances (negatives). The learned representation is then used on downstream tasks such as image classification. Using methodology from MoCo v2 (Chen et al., 2020), we divided negatives by their difficulty for a given query and studied which difficulty ranges were most important for learning useful representations. We found a minority of negatives -- the hardest 5% -- were both necessary and sufficient for the downstream task to reach nearly full accuracy. Conversely, the easiest 95% of negatives were unnecessary and insufficient. Moreover, the very hardest 0.1% of negatives were unnecessary and sometimes detrimental. Finally, we studied the properties of negatives that affect their hardness, and found that hard negatives were more semantically similar to the query, and that some negatives were more consistently easy or hard than we would expect by chance. Together, our results indicate that negatives vary in importance and that CID may benefit from more intelligent negative treatment.

26.Intrapersonal Parameter Optimization for Offline Handwritten Signature Augmentation ⬇️

Usually, in a real-world scenario, few signature samples are available to train an automatic signature verification system (ASVS). However, such systems do indeed need a lot of signatures to achieve an acceptable performance. Neuromotor signature duplication methods and feature space augmentation methods may be used to meet the need for an increase in the number of samples. Such techniques manually or empirically define a set of parameters to introduce a degree of writer variability. Therefore, in the present study, a method to automatically model the most common writer variability traits is proposed. The method is used to generate offline signatures in the image and the feature space and train an ASVS. We also introduce an alternative approach to evaluate the quality of samples considering their feature vectors. We evaluated the performance of an ASVS with the generated samples using three well-known offline signature datasets: GPDS, MCYT-75, and CEDAR. In GPDS-300, when the SVM classifier was trained using one genuine signature per writer and the duplicates generated in the image space, the Equal Error Rate (EER) decreased from 5.71% to 1.08%. Under the same conditions, the EER decreased to 1.04% using the feature space augmentation technique. We also verified that the model that generates duplicates in the image space reproduces the most common writer variability traits in the three different datasets.

27.Video Action Understanding: A Tutorial ⬇️

Many believe that the successes of deep learning on image understanding problems can be replicated in the realm of video understanding. However, the span of video action problems and the set of proposed deep learning solutions is arguably wider and more diverse than those of their 2D image siblings. Finding, identifying, and predicting actions are a few of the most salient tasks in video action understanding. This tutorial clarifies a taxonomy of video action problems, highlights datasets and metrics used to baseline each problem, describes common data preparation methods, and presents the building blocks of state-of-the-art deep learning model architectures.

28.On Deep Learning Techniques to Boost Monocular Depth Estimation for Autonomous Navigation ⬇️

Inferring the depth of images is a fundamental inverse problem within the field of Computer Vision since depth information is obtained through 2D images, which can be generated from infinite possibilities of observed real scenes. Benefiting from the progress of Convolutional Neural Networks (CNNs) to explore structural features and spatial image information, Single Image Depth Estimation (SIDE) is often highlighted in scopes of scientific and technological innovation, as this concept provides advantages related to its low implementation cost and robustness to environmental conditions. In the context of autonomous vehicles, state-of-the-art CNNs optimize the SIDE task by producing high-quality depth maps, which are essential during the autonomous navigation process in different locations. However, such networks are usually supervised by sparse and noisy depth data, from Light Detection and Ranging (LiDAR) laser scans, and are carried out at high computational cost, requiring high-performance Graphic Processing Units (GPUs). Therefore, we propose a new lightweight and fast supervised CNN architecture combined with novel feature extraction models which are designed for real-world autonomous navigation. We also introduce an efficient surface normals module, jointly with a simple geometric 2.5D loss function, to solve SIDE problems. We also innovate by incorporating multiple Deep Learning techniques, such as the use of densification algorithms and additional semantic, surface normals and depth information to train our framework. The method introduced in this work focuses on robotic applications in indoor and outdoor environments and its results are evaluated on the competitive and publicly available NYU Depth V2 and KITTI Depth datasets.

29.A spatial model checker in GPU (extended version) ⬇️

The tool voxlogica merges the state-of-the-art library of computational imaging algorithms ITK with the combination of declarative specification and optimised execution provided by spatial logic model checking. The analysis of an existing benchmark for segmentation of brain tumours via a simple logical specification reached state-of-the-art accuracy. We present a new, GPU-based version of voxlogica and discuss its implementation, scalability, and applications.

30.Privacy-Preserving Object Detection & Localization Using Distributed Machine Learning: A Case Study of Infant Eyeblink Conditioning ⬇️

Distributed machine learning is becoming a popular model-training method due to privacy, computational scalability, and bandwidth capacities. In this work, we explore scalable distributed-training versions of two algorithms commonly used in object detection. A novel distributed training algorithm using Mean Weight Matrix Aggregation (MWMA) is proposed for Linear Support Vector Machine (L-SVM) object detection based in Histogram of Orientated Gradients (HOG). In addition, a novel Weighted Bin Aggregation (WBA) algorithm is proposed for distributed training of Ensemble of Regression Trees (ERT) landmark localization. Both algorithms do not restrict the location of model aggregation and allow custom architectures for model distribution. For this work, a Pool-Based Local Training and Aggregation (PBLTA) architecture for both algorithms is explored. The application of both algorithms in the medical field is examined using a paradigm from the fields of psychology and neuroscience - eyeblink conditioning with infants - where models need to be trained on facial images while protecting participant privacy. Using distributed learning, models can be trained without sending image data to other nodes. The custom software has been made available for public use on GitHub: this https URL. Results show that the aggregation of models for the HOG algorithm using MWMA not only preserves the accuracy of the model but also allows for distributed learning with an accuracy increase of 0.9% compared with traditional learning. Furthermore, WBA allows for ERT model aggregation with an accuracy increase of 8% when compared to single-node models.

31.Fader Networks for domain adaptation on fMRI: ABIDE-II study ⬇️

ABIDE is the largest open-source autism spectrum disorder database with both fMRI data and full phenotype description. These data were extensively studied based on functional connectivity analysis as well as with deep learning on raw data, with top models accuracy close to 75% for separate scanning sites. Yet there is still a problem of models transferability between different scanning sites within ABIDE. In the current paper, we for the first time perform domain adaptation for brain pathology classification problem on raw neuroimaging data. We use 3D convolutional autoencoders to build the domain irrelevant latent space image representation and demonstrate this method to outperform existing approaches on ABIDE data.

32.Domain Shift in Computer Vision models for MRI data analysis: An Overview ⬇️

Machine learning and computer vision methods are showing good performance in medical imagery analysis. Yetonly a few applications are now in clinical use and one of the reasons for that is poor transferability of themodels to data from different sources or acquisition domains. Development of new methods and algorithms forthe transfer of training and adaptation of the domain in multi-modal medical imaging data is crucial for thedevelopment of accurate models and their use in clinics. In present work, we overview methods used to tackle thedomain shift problem in machine learning and computer vision. The algorithms discussed in this survey includeadvanced data processing, model architecture enhancing and featured training, as well as predicting in domaininvariant latent space. The application of the autoencoding neural networks and their domain-invariant variationsare heavily discussed in a survey. We observe the latest methods applied to the magnetic resonance imaging(MRI) data analysis and conclude on their performance as well as propose directions for further research.

33.Data Augmentation for Meta-Learning ⬇️

Conventional image classifiers are trained by randomly sampling mini-batches of images. To achieve state-of-the-art performance, sophisticated data augmentation schemes are used to expand the amount of training data available for sampling. In contrast, meta-learning algorithms sample not only images, but classes as well. We investigate how data augmentation can be used not only to expand the number of images available per class, but also to generate entirely new classes. We systematically dissect the meta-learning pipeline and investigate the distinct ways in which data augmentation can be integrated at both the image and class levels. Our proposed meta-specific data augmentation significantly improves the performance of meta-learners on few-shot classification benchmarks.

34.3D Segmentation Networks for Excessive Numbers of Classes: Distinct Bone Segmentation in Upper Bodies ⬇️

Segmentation of distinct bones plays a crucial role in diagnosis, planning, navigation, and the assessment of bone metastasis. It supplies semantic knowledge to visualisation tools for the planning of surgical interventions and the education of health professionals. Fully supervised segmentation of 3D data using Deep Learning methods has been extensively studied for many tasks but is usually restricted to distinguishing only a handful of classes. With 125 distinct bones, our case includes many more labels than typical 3D segmentation tasks. For this reason, the direct adaptation of most established methods is not possible. This paper discusses the intricacies of training a 3D segmentation network in a many-label setting and shows necessary modifications in network architecture, loss function, and data augmentation. As a result, we demonstrate the robustness of our method by automatically segmenting over one hundred distinct bones simultaneously in an end-to-end learnt fashion from a CT-scan.

35.Understanding bias in facial recognition technologies ⬇️

Over the past couple of years, the growing debate around automated facial recognition has reached a boiling point. As developers have continued to swiftly expand the scope of these kinds of technologies into an almost unbounded range of applications, an increasingly strident chorus of critical voices has sounded concerns about the injurious effects of the proliferation of such systems. Opponents argue that the irresponsible design and use of facial detection and recognition technologies (FDRTs) threatens to violate civil liberties, infringe on basic human rights and further entrench structural racism and systemic marginalisation. They also caution that the gradual creep of face surveillance infrastructures into every domain of lived experience may eventually eradicate the modern democratic forms of life that have long provided cherished means to individual flourishing, social solidarity and human self-creation. Defenders, by contrast, emphasise the gains in public safety, security and efficiency that digitally streamlined capacities for facial identification, identity verification and trait characterisation may bring. In this explainer, I focus on one central aspect of this debate: the role that dynamics of bias and discrimination play in the development and deployment of FDRTs. I examine how historical patterns of discrimination have made inroads into the design and implementation of FDRTs from their very earliest moments. And, I explain the ways in which the use of biased FDRTs can lead distributional and recognitional injustices. The explainer concludes with an exploration of broader ethical questions around the potential proliferation of pervasive face-based surveillance infrastructures and makes some recommendations for cultivating more responsible approaches to the development and governance of these technologies.

36.Fast meningioma segmentation in T1-weighted MRI volumes using a lightweight 3D deep learning architecture ⬇️

Automatic and consistent meningioma segmentation in T1-weighted MRI volumes and corresponding volumetric assessment is of use for diagnosis, treatment planning, and tumor growth evaluation. In this paper, we optimized the segmentation and processing speed performances using a large number of both surgically treated meningiomas and untreated meningiomas followed at the outpatient clinic. We studied two different 3D neural network architectures: (i) a simple encoder-decoder similar to a 3D U-Net, and (ii) a lightweight multi-scale architecture (PLS-Net). In addition, we studied the impact of different training schemes. For the validation studies, we used 698 T1-weighted MR volumes from St. Olav University Hospital, Trondheim, Norway. The models were evaluated in terms of detection accuracy, segmentation accuracy and training/inference speed. While both architectures reached a similar Dice score of 70% on average, the PLS-Net was more accurate with an F1-score of up to 88%. The highest accuracy was achieved for the largest meningiomas. Speed-wise, the PLS-Net architecture tended to converge in about 50 hours while 130 hours were necessary for U-Net. Inference with PLS-Net takes less than a second on GPU and about 15 seconds on CPU. Overall, with the use of mixed precision training, it was possible to train competitive segmentation models in a relatively short amount of time using the lightweight PLS-Net architecture. In the future, the focus should be brought toward the segmentation of small meningiomas (less than 2ml) to improve clinical relevance for automatic and early diagnosis as well as speed of growth estimates.

37.Using satellite imagery to understand and promote sustainable development ⬇️

Accurate and comprehensive measurements of a range of sustainable development outcomes are fundamental inputs into both research and policy. We synthesize the growing literature that uses satellite imagery to understand these outcomes, with a focus on approaches that combine imagery with machine learning. We quantify the paucity of ground data on key human-related outcomes and the growing abundance and resolution (spatial, temporal, and spectral) of satellite imagery. We then review recent machine learning approaches to model-building in the context of scarce and noisy training data, highlighting how this noise often leads to incorrect assessment of models' predictive performance. We quantify recent model performance across multiple sustainable development domains, discuss research and policy applications, explore constraints to future progress, and highlight key research directions for the field.

38.Practical Deep Raw Image Denoising on Mobile Devices ⬇️

Deep learning-based image denoising approaches have been extensively studied in recent years, prevailing in many public benchmark datasets. However, the stat-of-the-art networks are computationally too expensive to be directly applied on mobile devices. In this work, we propose a light-weight, efficient neural network-based raw image denoiser that runs smoothly on mainstream mobile devices, and produces high quality denoising results. Our key insights are twofold: (1) by measuring and estimating sensor noise level, a smaller network trained on synthetic sensor-specific data can out-perform larger ones trained on general data; (2) the large noise level variation under different ISO settings can be removed by a novel k-Sigma Transform, allowing a small network to efficiently handle a wide range of noise levels. We conduct extensive experiments to demonstrate the efficiency and accuracy of our approach. Our proposed mobile-friendly denoising model runs at ~70 milliseconds per megapixel on Qualcomm Snapdragon 855 chipset, and it is the basis of the night shot feature of several flagship smartphones released in 2019.

39.Efficient and high accuracy 3-D OCT angiography motion correction in pathology ⬇️

We propose a novel method for non-rigid 3-D motion correction of orthogonally raster-scanned optical coherence tomography angiography volumes. This is the first approach that aligns predominantly axial structural features like retinal layers and transverse angiographic vascular features in a joint optimization. Combined with the use of orthogonal scans and favorization of kinematically more plausible displacements, the approach allows subpixel alignment and micrometer-scale distortion correction in all 3 dimensions. As no specific structures or layers are segmented, the approach is by design robust to pathologic changes. It is furthermore designed for highly parallel implementation and brief runtime, allowing its integration in clinical routine even for high density or wide-field scans. We evaluated the algorithm with metrics related to clinically relevant features in a large-scale quantitative evaluation based on 204 volumetric scans of 17 subjects including both a wide range of pathologies and healthy controls. Using this method, we achieve state-of-the-art axial performance and show significant advances in both transverse co-alignment and distortion correction, especially in the pathologic subgroup.

40.Identifying Wrongly Predicted Samples: A Method for Active Learning ⬇️

State-of-the-art machine learning models require access to significant amount of annotated data in order to achieve the desired level of performance. While unlabelled data can be largely available and even abundant, annotation process can be quite expensive and limiting. Under the assumption that some samples are more important for a given task than others, active learning targets the problem of identifying the most informative samples that one should acquire annotations for. Instead of the conventional reliance on model uncertainty as a proxy to leverage new unknown labels, in this work we propose a simple sample selection criterion that moves beyond uncertainty. By first accepting the model prediction and then judging its effect on the generalization error, we can better identify wrongly predicted samples. We further present an approximation to our criterion that is very efficient and provides a similarity based interpretation. In addition to evaluating our method on the standard benchmarks of active learning, we consider the challenging yet realistic scenario of imbalanced data where categories are not equally represented. We show state-of-the-art results and better rates at identifying wrongly predicted samples. Our method is simple, model agnostic and relies on the current model status without the need for re-training from scratch.

41.Deep Ensembles for Low-Data Transfer Learning ⬇️

In the low-data regime, it is difficult to train good supervised models from scratch. Instead practitioners turn to pre-trained models, leveraging transfer learning. Ensembling is an empirically and theoretically appealing way to construct powerful predictive models, but the predominant approach of training multiple deep networks with different random initialisations collides with the need for transfer via pre-trained weights. In this work, we study different ways of creating ensembles from pre-trained models. We show that the nature of pre-training itself is a performant source of diversity, and propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset. The approach is simple: Use nearest-neighbour accuracy to rank pre-trained models, fine-tune the best ones with a small hyperparameter sweep, and greedily construct an ensemble to minimise validation cross-entropy. When evaluated together with strong baselines on 19 different downstream tasks (the Visual Task Adaptation Benchmark), this achieves state-of-the-art performance at a much lower inference budget, even when selecting from over 2,000 pre-trained models. We also assess our ensembles on ImageNet variants and show improved robustness to distribution shift.

42.GreedyFool: An Imperceptible Black-box Adversarial Example Attack against Neural Networks ⬇️

Deep neural networks (DNNs) are inherently vulnerable to well-designed input samples called adversarial examples. The adversary can easily fool DNNs by adding slight perturbations to the input. In this paper, we propose a novel black-box adversarial example attack named GreedyFool, which synthesizes adversarial examples based on the differential evolution and the greedy approximation. The differential evolution is utilized to evaluate the effects of perturbed pixels on the confidence of the DNNs-based classifier. The greedy approximation is an approximate optimization algorithm to automatically get adversarial perturbations. Existing works synthesize the adversarial examples by leveraging simple metrics to penalize the perturbations, which lack sufficient consideration of the human visual system (HVS), resulting in noticeable artifacts. In order to sufficient imperceptibility, we launch a lot of investigations into the HVS and design an integrated metric considering just noticeable distortion (JND), Weber-Fechner law, texture masking and channel modulation, which is proven to be a better metric to measure the perceptual distance between the benign examples and the adversarial ones. The experimental results demonstrate that the GreedyFool has several remarkable properties including black-box, 100% success rate, flexibility, automation and can synthesize the more imperceptible adversarial examples than the state-of-the-art pixel-wise methods.

43.Differential diagnosis and molecular stratification of gastrointestinal stromal tumors on CT images using a radiomics approach ⬇️

Distinguishing gastrointestinal stromal tumors (GISTs) from other intra-abdominal tumors and GISTs molecular analysis is necessary for treatment planning, but challenging due to its rarity. The aim of this study was to evaluate radiomics for distinguishing GISTs from other intra-abdominal tumors, and in GISTs, predict the \textit{c-KIT}, \textit{PDGFRA}, \textit{BRAF} mutational status and mitotic index (MI). All 247 included patients (125 GISTS, 122 non-GISTs) underwent a contrast-enhanced venous phase CT. The GIST vs. non-GIST radiomics model, including imaging, age, sex and location, had a mean area under the curve (AUC) of 0.82. Three radiologists had an AUC of 0.69, 0.76, and 0.84, respectively. The radiomics model had an AUC of 0.52 for \textit{c-KIT}, 0.56 for \textit{c-KIT} exon 11, and 0.52 for the MI. Hence, our radiomics model was able to distinguish GIST from non-GISTS with a performance similar to three radiologists, but was not able to predict the c-KIT mutation or MI.

44.Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout ⬇️

The vast majority of deep models use multiple gradient signals, typically corresponding to a sum of multiple loss terms, to update a shared set of trainable weights. However, these multiple updates can impede optimal training by pulling the model in conflicting directions. We present Gradient Sign Dropout (GradDrop), a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. GradDrop is implemented as a simple deep layer that can be used in any deep net and synergizes with other gradient balancing approaches. We show that GradDrop outperforms the state-of-the-art multiloss methods within traditional multitask and transfer learning settings, and we discuss how GradDrop reveals links between optimal multiloss training and gradient stochasticity.

45.Low-rank Convex/Sparse Thermal Matrix Approximation for Infrared-based Diagnostic System ⬇️

Active and passive thermography are two efficient techniques extensively used to measure heterogeneous thermal patterns leading to subsurface defects for diagnostic evaluations. This study conducts a comparative analysis on low-rank matrix approximation methods in thermography with applications of semi-, convex-, and sparse- non-negative matrix factorization (NMF) methods for detecting subsurface thermal patterns. These methods inherit the advantages of principal component thermography (PCT) and sparse PCT, whereas tackle negative bases in sparse PCT with non-negative constraints, and exhibit clustering property in processing data. The practicality and efficiency of these methods are demonstrated by the experimental results for subsurface defect detection in three specimens (for different depth and size defects) and preserving thermal heterogeneity for distinguishing breast abnormality in breast cancer screening dataset (accuracy of 74.1%, 75.8%, and 77.8%).

46.Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision ⬇️

Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG. Code and pre-trained models publicly available at this https URL

47.Measuring Visual Generalization in Continuous Control from Pixels ⬇️

Self-supervised learning and data augmentation have significantly reduced the performance gap between state and image-based reinforcement learning agents in continuous control tasks. However, it is still unclear whether current techniques can face a variety of visual conditions required by real-world environments. We propose a challenging benchmark that tests agents' visual generalization by adding graphical variety to existing continuous control domains. Our empirical analysis shows that current methods struggle to generalize across a diverse set of visual changes, and we examine the specific factors of variation that make these tasks difficult. We find that data augmentation techniques outperform self-supervised learning approaches and that more significant image transformations provide better visual generalization \footnote{The benchmark and our augmented actor-critic implementation are open-sourced @ this https URL)

48.Random Network Distillation as a Diversity Metric for Both Image and Text Generation ⬇️

Generative models are increasingly able to produce remarkably high quality images and text. The community has developed numerous evaluation metrics for comparing generative models. However, these metrics do not effectively quantify data diversity. We develop a new diversity metric that can readily be applied to data, both synthetic and natural, of any type. Our method employs random network distillation, a technique introduced in reinforcement learning. We validate and deploy this metric on both images and text. We further explore diversity in few-shot image generation, a setting which was previously difficult to evaluate.

49.Handwriting Quality Analysis using Online-Offline Models ⬇️

This work is part of an innovative e-learning project allowing the development of an advanced digital educational tool that provides feedback during the process of learning handwriting for young school children (three to eight years old). In this paper, we describe a new method for children handwriting quality analysis. It automatically detects mistakes, gives real-time on-line feedback for children's writing, and helps teachers comprehend and evaluate children's writing skills. The proposed method adjudges five main criteria shape, direction, stroke order, position respect to the reference lines, and kinematics of the trace. It analyzes the handwriting quality and automatically gives feedback based on the combination of three extracted models: Beta-Elliptic Model (BEM) using similarity detection (SD) and dissimilarity distance (DD) measure, Fourier Descriptor Model (FDM), and perceptive Convolutional Neural Network (CNN) with Support Vector Machine (SVM) comparison engine. The originality of our work lies partly in the system architecture which apprehends complementary dynamic, geometric, and visual representation of the examined handwritten scripts and in the efficient selected features adapted to various handwriting styles and multiple script languages such as Arabic, Latin, digits, and symbol drawing. The application offers two interactive interfaces respectively dedicated to learners, educators, experts or teachers and allows them to adapt it easily to the specificity of their disciples. The evaluation of our framework is enhanced by a database collected in Tunisia primary school with 400 children. Experimental results show the efficiency and robustness of our suggested framework that helps teachers and children by offering positive feedback throughout the handwriting learning process using tactile digital devices.

50.A Multi-Modal Method for Satire Detection using Textual and Visual Cues ⬇️

Satire is a form of humorous critique, but it is sometimes misinterpreted by readers as legitimate news, which can lead to harmful consequences. We observe that the images used in satirical news articles often contain absurd or ridiculous content and that image manipulation is used to create fictional scenarios. While previous work have studied text-based methods, in this work we propose a multi-modal approach based on state-of-the-art visiolinguistic model ViLBERT. To this end, we create a new dataset consisting of images and headlines of regular and satirical news for the task of satire detection. We fine-tune ViLBERT on the dataset and train a convolutional neural network that uses an image forensics technique. Evaluation on the dataset shows that our proposed multi-modal approach outperforms image-only, text-only, and simple fusion baselines.

51.LiDAM: Semi-Supervised Learning with Localized Domain Adaptation and Iterative Matching ⬇️

Although data is abundant, data labeling is expensive. Semi-supervised learning methods combine a few labeled samples with a large corpus of unlabeled data to effectively train models. This paper introduces our proposed method LiDAM, a semi-supervised learning approach rooted in both domain adaptation and self-paced learning. LiDAM first performs localized domain shifts to extract better domain-invariant features for the model that results in more accurate clusters and pseudo-labels. These pseudo-labels are then aligned with real class labels in a self-paced fashion using a novel iterative matching technique that is based on majority consistency over high-confidence predictions. Simultaneously, a final classifier is trained to predict ground-truth labels until convergence. LiDAM achieves state-of-the-art performance on the CIFAR-100 dataset, outperforming FixMatch (73.50% vs. 71.82%) when using 2500 labels.

52.Training independent subnetworks for robust prediction ⬇️

Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant computational cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved `for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand. By ensembling the predictions made by the subnetworks, we improve model robustness without increasing compute. We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, ImageNet, and their out-of-distribution variants compared to previous methods.