Skip to content

Latest commit

 

History

History
93 lines (93 loc) · 57.3 KB

20181025.md

File metadata and controls

93 lines (93 loc) · 57.3 KB

ArXiv cs.CV --Thu, 25 Oct 2018

1.Spatiotemporal CNNs for Pornography Detection in Videos pdf

With the increasing use of social networks and mobile devices, the number of videos posted on the Internet is growing exponentially. Among the inappropriate contents published on the Internet, pornography is one of the most worrying as it can be accessed by teens and children. Two spatiotemporal CNNs, VGG-C3D CNN and ResNet R(2+1)D CNN, were assessed for pornography detection in videos in the present study. Experimental results using the Pornography-800 dataset showed that these spatiotemporal CNNs performed better than some state-of-the-art methods based on bag of visual words and are competitive with other CNN-based approaches, reaching accuracy of 95.1%.

2.Neighbourhood Consensus Networks pdf

We address the problem of finding reliable dense correspondences between a pair of images. This is a challenging task due to strong appearance differences between the corresponding scene elements and ambiguities generated by repetitive patterns. The contributions of this work are threefold. First, inspired by the classic idea of disambiguating feature matches using semi-local constraints, we develop an end-to-end trainable convolutional neural network architecture that identifies sets of spatially consistent matches by analyzing neighbourhood consensus patterns in the 4D space of all possible correspondences between a pair of images without the need for a global geometric model. Second, we demonstrate that the model can be trained effectively from weak supervision in the form of matching and non-matching image pairs without the need for costly manual annotation of point to point correspondences. Third, we show the proposed neighbourhood consensus network can be applied to a range of matching tasks including both category- and instance-level matching, obtaining the state-of-the-art results on the PF Pascal dataset and the InLoc indoor visual localization benchmark.

3.The UAVid Dataset for Video Semantic Segmentation pdf

Video semantic segmentation has been one of the research focus in computer vision recently. It serves as a perception foundation for many fields such as robotics and autonomous driving. The fast development of semantic segmentation attributes enormously to the large scale datasets, especially for the deep learning related methods. Currently, there already exist several semantic segmentation datasets for complex urban scenes, such as the Cityscapes and CamVid datasets. They have been the standard datasets for comparison among semantic segmentation methods. In this paper, we introduce a new high resolution UAV video semantic segmentation dataset as complement, UAVid. Our UAV dataset consists of 30 video sequences capturing high resolution images. In total, 300 images have been densely labelled with 8 classes for urban scene understanding task. Our dataset brings out new challenges. We provide several deep learning baseline methods, among which the proposed novel Multi-Scale-Dilation net performs the best via multi-scale feature extraction. We have also explored the usability of sequence data by leveraging on CRF model in both spatial and temporal domain.

4.LoGAN: Generating Logos with a Generative Adversarial Neural Network Conditioned on color pdf

Designing a logo is a long, complicated, and expensive process for any designer. However, recent advancements in generative algorithms provide models that could offer a possible solution. Logos are multi-modal, have very few categorical properties, and do not have a continuous latent space. Yet, conditional generative adversarial networks can be used to generate logos that could help designers in their creative process. We propose LoGAN: an improved auxiliary classifier Wasserstein generative adversarial neural network (with gradient penalty) that is able to generate logos conditioned on twelve different colors. In 768 generated instances (12 classes and 64 logos per class), when looking at the most prominent color, the conditional generation part of the model has an overall precision and recall of 0.8 and 0.7 respectively. LoGAN's results offer a first glance at how artificial intelligence can be used to assist designers in their creative process and open promising future directions, such as including more descriptive labels which will provide a more exhaustive and easy-to-use system.

5.Implicit Modeling with Uncertainty Estimation for Intravoxel Incoherent Motion Imaging pdf

Intravoxel incoherent motion (IVIM) imaging allows contrast-agent free in vivo perfusion quantification with magnetic resonance imaging (MRI). However, its use is limited by typically low accuracy due to low signal-to-noise ratio (SNR) at large gradient encoding magnitudes as well as dephasing artefacts caused by subject motion, which is particularly challenging in fetal MRI. To mitigate this problem, we propose an implicit IVIM signal acquisition model with which we learn full posterior distribution of perfusion parameters using artificial neural networks. This posterior then encapsulates the uncertainty of the inferred parameter estimates, which we validate herein via numerical experiments with rejection-based Bayesian sampling. Compared to state-of-the-art IVIM estimation method of segmented least-squares fitting, our proposed approach improves parameter estimation accuracy by 65% on synthetic anisotropic perfusion data. On paired rescans of in vivo fetal MRI, our method increases repeatability of parameter estimation in placenta by 46%.

6.Boosted Convolutional Neural Networks for Motor Imagery EEG Decoding with Multiwavelet-based Time-Frequency Conditional Granger Causality Analysis pdf

Decoding EEG signals of different mental states is a challenging task for brain-computer interfaces (BCIs) due to nonstationarity of perceptual decision processes. This paper presents a novel boosted convolutional neural networks (ConvNets) decoding scheme for motor imagery (MI) EEG signals assisted by the multiwavelet-based time-frequency (TF) causality analysis. Specifically, multiwavelet basis functions are first combined with Geweke spectral measure to obtain high-resolution TF-conditional Granger causality (CGC) representations, where a regularized orthogonal forward regression (ROFR) algorithm is adopted to detect a parsimonious model with good generalization performance. The causality images for network input preserving time, frequency and location information of connectivity are then designed based on the TF-CGC distributions of alpha band multichannel EEG signals. Further constructed boosted ConvNets by using spatio-temporal convolutions as well as advances in deep learning including cropping and boosting methods, to extract discriminative causality features and classify MI tasks. Our proposed approach outperforms the competition winner algorithm with 12.15% increase in average accuracy and 74.02% decrease in associated inter subject standard deviation for the same binary classification on BCI competition-IV dataset-IIa. Experiment results indicate that the boosted ConvNets with causality images works well in decoding MI-EEG signals and provides a promising framework for developing MI-BCI systems.

7.Generative adversarial networks and adversarial methods in biomedical image analysis pdf

Generative adversarial networks (GANs) and other adversarial methods are based on a game-theoretical perspective on joint optimization of two neural networks as players in a game. Adversarial techniques have been extensively used to synthesize and analyze biomedical images. We provide an introduction to GANs and adversarial methods, with an overview of biomedical image analysis tasks that have benefited from such methods. We conclude with a discussion of strengths and limitations of adversarial methods in biomedical image analysis, and propose potential future research directions.

8.Differentiable Fine-grained Quantization for Deep Neural Network Compression pdf

Neural networks have shown great performance in cognitive tasks. When deploying network models on mobile devices with limited resources, weight quantization has been widely adopted. Binary quantization obtains the highest compression but usually results in big accuracy drop. In practice, 8-bit or 16-bit quantization is often used aiming at maintaining the same accuracy as the original 32-bit precision. We observe different layers have different accuracy sensitivity of quantization. Thus judiciously selecting different precision for different layers/structures can potentially produce more efficient models compared to traditional quantization methods by striking a better balance between accuracy and compression rate. In this work, we propose a fine-grained quantization approach for deep neural network compression by relaxing the search space of quantization bitwidth from discrete to a continuous domain. The proposed approach applies gradient descend based optimization to generate a mixed-precision quantization scheme that outperforms the accuracy of traditional quantization methods under the same compression rate.

9.Machine Learning Methods for Track Classification in the AT-TPC pdf

We evaluate machine learning methods for event classification in the Active-Target Time Projection Chamber detector at the National Superconducting Cyclotron Laboratory (NSCL) at Michigan State University. An automated method to single out the desired reaction product would result in more accurate physics results as well as a faster analysis process. Binary and multi-class classification methods were tested on data produced by the $^{46}$Ar(p,p) experiment run at the NSCL in September 2015. We found a Convolutional Neural Network to be the most successful classifier of proton scattering events for transfer learning. Results from this investigation and recommendations for event classification in future experiments are presented.

10.Dermatologist Level Dermoscopy Skin Cancer Classification Using Different Deep Learning Convolutional Neural Networks Algorithms pdf

In this paper, the effectiveness and capability of convolutional neural networks have been studied in the classification of 8 skin diseases. Different pre-trained state-of-the-art architectures (DenseNet 201, ResNet 152, Inception v3, InceptionResNet v2) were used and applied on 10135 dermoscopy skin images in total (HAM10000: 10015, PH2: 120). The utilized dataset includes 8 diagnostic categories - melanoma, melanocytic nevi, basal cell carcinoma, benign keratosis, actinic keratosis and intraepithelial carcinoma, dermatofibroma, vascular lesions, and atypical nevi. The aim is to compare the ability of deep learning with the performance of highly trained dermatologists. Overall, the mean results show that all deep learning models outperformed dermatologists (at least 11%). The best ROC AUC values for melanoma and basal cell carcinoma are 94.40% (ResNet 152) and 99.30% (DenseNet 201) versus 82.26% and 88.82% of dermatologists, respectively. Also, DenseNet 201 had the highest macro and micro averaged AUC values for overall classification (98.16%, 98.79%, respectively).

11.Block Matching Frame based Material Reconstruction for Spectral CT pdf

Spectral computed tomography (CT) has a great potential in material identification and decomposition. To achieve high-quality material composition images and further suppress the x-ray beam hardening artifacts, we first propose a one-step material reconstruction model based on Taylor first-order expansion. Then, we develop a basic material reconstruction method named material simultaneous algebraic reconstruction technique (MSART). Considering the local similarity of each material image, we incorporate a powerful block matching frame (BMF) into the material reconstruction (MR) model and generate a BMF based MR (BMFMR) method. Because the BMFMR model contains the L0-norm problem, we adopt a split-Bregman method for optimization. The numerical simulation and physical phantom experiment results validate the correctness of the material reconstruction algorithms and demonstrate that the BMF regularization outperforms the total variation and no-local mean regularizations.

12.From Machine to Machine: An OCT-trained Deep Learning Algorithm for Objective Quantification of Glaucomatous Damage in Fundus Photographs pdf

Previous approaches using deep learning algorithms to classify glaucomatous damage on fundus photographs have been limited by the requirement for human labeling of a reference training set. We propose a new approach using spectral-domain optical coherence tomography (SDOCT) data to train a deep learning algorithm to quantify glaucomatous structural damage on optic disc photographs. The dataset included 32,820 pairs of optic disc photos and SDOCT retinal nerve fiber layer (RNFL) scans from 2,312 eyes of 1,198 subjects. A deep learning convolutional neural network was trained to assess optic disc photographs and predict SDOCT average RNFL thickness. The performance of the algorithm was evaluated in an independent test sample. The mean prediction of average RNFL thickness from all 6,292 optic disc photos in the test set was 83.3$\pm$14.5 $\mu$m, whereas the mean average RNFL thickness from all corresponding SDOCT scans was 82.5$\pm$16.8 $\mu$m (P = 0.164). There was a very strong correlation between predicted and observed RNFL thickness values (r = 0.832; P<0.001), with mean absolute error of the predictions of 7.39 $\mu$m. The areas under the receiver operating characteristic curves for discriminating glaucoma from healthy eyes with the deep learning predictions and actual SDOCT measurements were 0.944 (95$%$ CI: 0.912- 0.966) and 0.940 (95$%$ CI: 0.902 - 0.966), respectively (P = 0.724). In conclusion, we introduced a novel deep learning approach to assess optic disc photographs and provide quantitative information about the amount of neural damage. This approach could potentially be used to diagnose and stage glaucomatous damage from optic disc photographs.

13.Predicting optical coherence tomography-derived diabetic macular edema grades from fundus photographs using deep learning pdf

Diabetic eye disease is one of the fastest growing causes of preventable blindness. With the advent of anti-VEGF (vascular endothelial growth factor) therapies, it has become increasingly important to detect center-involved diabetic macular edema. However, center-involved diabetic macular edema is diagnosed using optical coherence tomography (OCT), which is not generally available at screening sites because of cost and workflow constraints. Instead, screening programs rely on the detection of hard exudates as a proxy for DME on color fundus photographs, often resulting in high false positive or false negative calls. To improve the accuracy of DME screening, we trained a deep learning model to use color fundus photographs to predict DME grades derived from OCT exams. Our "OCT-DME" model had an AUC of 0.89 (95% CI: 0.87-0.91), which corresponds to a sensitivity of 85% at a specificity of 80%. In comparison, three retinal specialists had similar sensitivities (82-85%), but only half the specificity (45-50%, p<0.001 for each comparison with model). The positive predictive value (PPV) of the OCT-DME model was 61% (95% CI: 56-66%), approximately double the 36-38% by the retina specialists. In addition, we used saliency and other techniques to examine how the model is making its prediction. The ability of deep learning algorithms to make clinically relevant predictions that generally require sophisticated 3D-imaging equipment from simple 2D images has broad relevance to many other applications in medical imaging.

14.Visions of a generalized probability theory pdf

In this Book we argue that the fruitful interaction of computer vision and belief calculus is capable of stimulating significant advances in both fields. From a methodological point of view, novel theoretical results concerning the geometric and algebraic properties of belief functions as mathematical objects are illustrated and discussed in Part II, with a focus on both a perspective 'geometric approach' to uncertainty and an algebraic solution to the issue of conflicting evidence. In Part III we show how these theoretical developments arise from important computer vision problems (such as articulated object tracking, data association and object pose estimation) to which, in turn, the evidential formalism is able to provide interesting new solutions. Finally, some initial steps towards a generalization of the notion of total probability to belief functions are taken, in the perspective of endowing the theory of evidence with a complete battery of estimation and inference tools to the benefit of all scientists and practitioners.

15.A Case for Object Compositionality in Deep Generative Models of Images pdf

Deep generative models seek to recover the process with which the observed data was generated. They may be used to synthesize new samples or to subsequently extract representations. Successful approaches in the domain of images are driven by several core inductive biases. However, a bias to account for the compositional way in which humans structure a visual scene in terms of objects has frequently been overlooked. In this work we propose to structure the generator of a GAN to consider objects and their relations explicitly, and generate images by means of composition. This provides a way to efficiently learn a more accurate generative model of real-world images, and serves as an initial step towards learning corresponding object representations. We evaluate our approach on several multi-object image datasets, and find that the generator learns to identify and disentangle information corresponding to different objects at a representational level. A human study reveals that the resulting generative model is better at generating images that are more faithful to the reference distribution.

16.Characterization of Brain Cortical Morphology Using Localized Topology-Encoding Graphs pdf

The human brain cortical layer has a convoluted morphology that is unique to each individual. Characterization of the cortical morphology is necessary in longitudinal studies of structural brain change, as well as in discriminating individuals in health and disease. A method for encoding the cortical morphology in the form of a graph is presented. The design of graphs that encode the global cerebral hemisphere cortices as well as localized cortical regions is proposed. Spectral features of these graphs are then studied and proposed as descriptors of cortical morphology. As proof-of-concept of their applicability in characterizing cortical morphology, the descriptors are studied in the context of discriminating individuals based on their sex.

17.Strategies for Training Stain Invariant CNNs pdf

An important part of Digital Pathology is the analysis of multiple digitised whole slide images from differently stained tissue sections. It is common practice to mount consecutive sections containing corresponding microscopic structures on glass slides, and to stain them differently to highlight specific tissue components. These multiple staining modalities result in very different images but include a significant amount of consistent image information. Deep learning approaches have recently been proposed to analyse these images in order to automatically identify objects of interest for pathologists. These supervised approaches require a vast amount of annotations, which are difficult and expensive to acquire---a problem that is multiplied with multiple stainings. This article presents several training strategies that make progress towards stain invariant networks. By training the network on one commonly used staining modality and applying it to images that include corresponding but differently stained tissue structures, the presented unsupervised strategies demonstrate significant improvements over standard training strategies.

18.Projecting Trouble: Light Based Adversarial Attacks on Deep Learning Classifiers pdf

This work demonstrates a physical attack on a deep learning image classification system using projected light onto a physical scene. Prior work is dominated by techniques for creating adversarial examples which directly manipulate the digital input of the classifier. Such an attack is limited to scenarios where the adversary can directly update the inputs to the classifier. This could happen by intercepting and modifying the inputs to an online API such as Clarifai or Cloud Vision. Such limitations have led to a vein of research around physical attacks where objects are constructed to be inherently adversarial or adversarial modifications are added to cause misclassification. Our work differs from other physical attacks in that we can cause misclassification dynamically without altering physical objects in a permanent way.
We construct an experimental setup which includes a light projection source, an object for classification, and a camera to capture the scene. Experiments are conducted against 2D and 3D objects from CIFAR-10. Initial tests show projected light patterns selected via differential evolution could degrade classification from 98% to 22% and 89% to 43% probability for 2D and 3D targets respectively. Subsequent experiments explore sensitivity to physical setup and compare two additional baseline conditions for all 10 CIFAR classes. Some physical targets are more susceptible to perturbation. Simple attacks show near equivalent success, and 6 of the 10 classes were disrupted by light.

19.Downsampling leads to Image Memorization in Convolutional Autoencoders pdf

Memorization of data in deep neural networks has become a subject of significant research interest. In this paper, we link memorization of images in deep convolutional autoencoders to downsampling through strided convolution. To analyze this mechanism in a simpler setting, we train linear convolutional autoencoders and show that linear combinations of training data are stored as eigenvectors in the linear operator corresponding to the network when downsampling is used. On the other hand, networks without downsampling do not memorize training data. We provide further evidence that the same effect happens in nonlinear networks. Moreover, downsampling in nonlinear networks causes the model to not only memorize linear combinations of images, but individual training images. Since convolutional autoencoder components are building blocks of deep convolutional networks, we envision that our findings will shed light on the important phenomenon of memorization in over-parameterized deep networks.

20.Bottleneck Supervised U-Net for Pixel-wise Liver and Tumor Segmentation pdf

Convolutional neural network (CNN) has been widely used for image processing tasks.In this paper we design a bottleneck supervised U-Net model and apply it to liver and tumor segmentation. Taking an image as input, the model outputs segmented images of the same size, each pixel of which takes value from 1 to K where K is the number of classes to be segmented. The innovations of this paper are two-fold: first we design a novel U-Net structure which include dense block and inception block as the base U-Net; second we design a double U-Net architecture based on the base U-Net and includes an encoding U-Net and a segmentation U-Net. The encoding U-Net is first trained to encode the labels, then the encodings are used to supervise the bottleneck of the segmentation U-Net. While training the segmentation U-Net, a weighted average of dice loss(for the final output) and MSE loss(for the bottleneck) is used as the overall loss function. This approach can help retain the hidden features of input images. The model is applied to a liver tumor 3D CT scan dataset to conduct liver and tumor segmentation sequentially. Experimental results indicate bottleneck supervised U-Net can accomplish segmentation tasks effectively with better performance in controlling shape distortion, reducing false positive and false negative, besides accelerating convergence. Besides, this model has good generalization for further improvement.

21.Hyper-Process Model: A Zero-Shot Learning algorithm for Regression Problems based on Shape Analysis pdf

Zero-shot learning (ZSL) can be defined by correctly solving a task where no training data is available, based on previous acquired knowledge from different, but related tasks. So far, this area has mostly drawn the attention from computer vision community where a new unseen image needs to be correctly classified, assuming the target class was not used in the training procedure. Apart from image classification, only a couple of generic methods were proposed that are applicable to both classification and regression. These learn the relation among model coefficients so new ones can be predicted according to provided conditions. So far, up to our knowledge, no methods exist that are applicable only to regression, and take advantage from such setting. Therefore, the present work proposes a novel algorithm for regression problems that uses data drawn from trained models, instead of model coefficients. In this case, a shape analyses on the data is performed to create a statistical shape model and generate new shapes to train new models. The proposed algorithm is tested in a theoretical setting using the beta distribution where main problem to solve is to estimate a function that predicts curves, based on already learned different, but related ones.

22.Vehicle classification using ResNets, localisation and spatially-weighted pooling pdf

We investigate whether ResNet architectures can outperform more traditional Convolutional Neural Networks on the task of fine-grained vehicle classification. We train and test ResNet-18, ResNet-34 and ResNet-50 on the Comprehensive Cars dataset without pre-training on other datasets. We then modify the networks to use Spatially Weighted Pooling. Finally, we add a localisation step before the classification process, using a network based on ResNet-50. We find that using Spatially Weighted Pooling and localisation both improve classification accuracy of ResNet50. Spatially Weighted Pooling increases accuracy by 1.5 percent points and localisation increases accuracy by 3.4 percent points. Using both increases accuracy by 3.7 percent points giving a top-1 accuracy of 96.351% on the Comprehensive Cars dataset. Our method achieves higher accuracy than a range of methods including those that use traditional CNNs. However, our method does not perform quite as well as pre-trained networks that use Spatially Weighted Pooling.

23.Instance Segmentation and Object Detection with Bounding Shape Masks pdf

Many recent object detection algorithms use the bounding box regressor to predict the position coordinates of an object (i.e., to predict four continuous variables of an object's bounding box information). To improve object detection accuracy, we propose four types of object boundary segmentation masks that provide position information in a different manner than that done by object detection algorithms, Additionally, we investigated the effect of the proposed object bounding shape masks on instance segmentation. To evaluate the proposed masks, our method adds a proposed bounding shape (or box) mask to extend the Faster R-CNN framework; we call this Bounding Shape (or Box) Mask R-CNN. We experimentally verified its performance with two benchmark datasets, MS COCO and Cityscapes. The results indicate that our proposed models generally outperform Faster R-CNN and Mask R-CNN.

24.Coherence Constraints in Facial Expression Recognition pdf

Recognizing facial expressions from static images or video sequences is a widely studied but still challenging problem. The recent progresses obtained by deep neural architectures, or by ensembles of heterogeneous models, have shown that integrating multiple input representations leads to state-of-the-art results. In particular, the appearance and the shape of the input face, or the representations of some face parts, are commonly used to boost the quality of the recognizer. This paper investigates the application of Convolutional Neural Networks (CNNs) with the aim of building a versatile recognizer of expressions in static images that can be further applied to video sequences. We first study the importance of different face parts in the recognition task, focussing on appearance and shape-related features. Then we cast the learning problem in the Semi-Supervised setting, exploiting video data, where only a few frames are supervised. The unsupervised portion of the training data is used to enforce three types of coherence, namely temporal coherence, coherence among the predictions on the face parts and coherence between appearance and shape-based representation. Our experimental analysis shows that coherence constraints can improve the quality of the expression recognizer, thus offering a suitable basis to profitably exploit unsupervised video sequences. Finally we present some examples with occlusions where the shape-based predictor performs better than the appearance one.

25.Multi-Stage Reinforcement Learning For Object Detection pdf

We present a reinforcement learning approach for detecting objects within an image. Our approach performs a step-wise deformation of a bounding box with the goal of tightly framing the object. It uses a hierarchical tree-like representation of predefined region candidates, which the agent can zoom in on. This reduces the number of region candidates that must be evaluated so that the agent can afford to compute new feature maps before each step to enhance detection quality. We compare an approach that is based purely on zoom actions with one that is extended by a second refinement stage to fine-tune the bounding box after each zoom step. We also improve the fitting ability by allowing for different aspect ratios of the bounding box. Finally, we propose different reward functions to lead to a better guidance of the agent while following its search trajectories. Experiments indicate that each of these extensions leads to more correct detections. The best performing approach comprises a zoom stage and a refinement stage, uses aspect-ratio modifying actions and is trained using a combination of three different reward metrics.

26.Multi-scale Geometric Summaries for Similarity-based Sensor Fusion pdf

In this work, we address fusion of heterogeneous sensor data using wavelet-based summaries of fused self-similarity information from each sensor. The technique we develop is quite general, does not require domain specific knowledge or physical models, and requires no training. Nonetheless, it can perform surprisingly well at the general task of differentiating classes of time-ordered behavior sequences which are sensed by more than one modality. As a demonstration of our capabilities in the audio to video context, we focus on the differentiation of speech sequences.
Data from two or more modalities first are represented using self-similarity matrices(SSMs) corresponding to time-ordered point clouds in feature spaces of each of these data sources; we note that these feature spaces can be of entirely different scale and dimensionality.
A fused similarity template is then derived from the modality-specific SSMs using a technique called similarity network fusion (SNF). We investigate pipelines using SNF as both an upstream (feature-level) and a downstream (ranking-level) fusion technique. Multiscale geometric features of this template are then extracted using a recently-developed technique called the scattering transform, and these features are then used to differentiate speech sequences. This method outperforms unsupervised techniques which operate directly on the raw data, and it also outperforms stovepiped methods which operate on SSMs separately derived from the distinct modalities. The benefits of this method become even more apparent as the simulated peak signal to noise ratio decreases.

27.Incremental Deep Learning for Robust Object Detection in Unknown Cluttered Environments pdf

Object detection in streaming images is a major step in different detection-based applications, such as object tracking, action recognition, robot navigation, and visual surveillance applications. In mostcases, image quality is noisy and biased, and as a result, the data distributions are disturbed and imbalanced. Most object detection approaches, such as the faster region-based convolutional neural network (Faster RCNN), Single Shot Multibox Detector with 300x300 inputs (SSD300), and You Only Look Once version 2 (YOLOv2), rely on simple sampling without considering distortions and noise under real-world changing environments, despite poor object labeling. In this paper, we propose an Incremental active semi-supervised learning (IASSL) technology for unseen object detection. It combines batch-based active learning (AL) and bin-based semi-supervised learning (SSL) to leverage the strong points of AL's exploration and SSL's exploitation capabilities. A collaborative sampling method is also adopted to measure the uncertainty and diversity of AL and the confidence in SSL. Batch-based AL allows us to select more informative, confident, and representative samples with low cost. Bin-based SSL divides streaming image samples into several bins, and each bin repeatedly transfers the discriminative knowledge of convolutional neural network (CNN) deep learning to the next bin until the performance criterion is reached. IASSL can overcome noisy and biased labels in unknown, cluttered data distributions. We obtain superior performance, compared to state-of-the-art technologies such as Faster RCNN, SSD300, and YOLOv2.

28.Dental pathology detection in 3D cone-beam CT pdf

Cone-beam computed tomography (CBCT) is a valuable imaging method in dental diagnostics that provides information not available in traditional 2D imaging. However, interpretation of CBCT images is a time-consuming process that requires a physician to work with complicated software. In this work we propose an automated pipeline composed of several deep convolutional neural networks and algorithmic heuristics. Our task is two-fold: a) find locations of each present tooth inside a 3D image volume, and b) detect several common tooth conditions in each tooth. The proposed system achieves 96.3% accuracy in tooth localization and an average of 0.94 AUROC for 6 common tooth conditions.

29.Coarse-to-fine volumetric segmentation of teeth in Cone-Beam CT pdf

We consider the problem of localizing and segmenting individual teeth inside 3D Cone-Beam Computed Tomography (CBCT) images. To handle large image sizes we approach this task with a coarse-to-fine framework, where the whole volume is first analyzed as a 33-class semantic segmentation (adults have up to 32 teeth) in coarse resolution, followed by binary semantic segmentation of the cropped region of interest in original resolution. To improve the performance of the challenging 33-class segmentation, we first train the Coarse step model on a large weakly labeled dataset, then fine-tune it on a smaller precisely labeled dataset. The Fine step model is trained with precise labels only. Experiments using our in-house dataset show significant improvement for both weakly-supervised pretraining and for the addition of the Fine step. Empirically, this framework yields precise teeth masks with low localization errors sufficient for many real-world applications.

30.Mask Propagation Network for Video Object Segmentation pdf

In this work, we propose a mask propagation network to treat the video segmentation problem as a concept of the guided instance segmentation. Similar to most MaskTrack based video segmentation methods, our method takes the mask probability map of previous frame and the appearance of current frame as inputs, and predicts the mask probability map for the current frame. Specifically, we adopt the Xception backbone based DeepLab v3+ model as the probability map predictor in our prediction pipeline. Besides, instead of the full image and the original mask probability, our network takes the region of interest of the instance, and the new mask probability which warped by the optical flow between the previous and current frames as the inputs. We also ensemble the modified One-Shot Video Segmentation Network to make the final predictions in order to retrieve and segment the missing instance.

31.Learning color space adaptation from synthetic to real images of cirrus clouds pdf

Training on synthetic data is becoming popular in vision due to the convenient acquisition of accurate pixel-level labels. But the domain gap between synthetic and real images significantly degrades the performance of the trained model. We propose a color space adaptation method to bridge the gap. A set of closed-form operations are adopted to make color space adjustments while preserving the labels. We embed these operations into a two-stage learning approach, and demonstrate the adaptation efficacy on the semantic segmentation task of cirrus clouds.

32.Cross-Resolution Person Re-identification with Deep Antithetical Learning pdf

Images with different resolutions are ubiquitous in public person re-identification (ReID) datasets and real-world scenes, it is thus crucial for a person ReID model to handle the image resolution variations for improving its generalization ability. However, most existing person ReID methods pay little attention to this resolution discrepancy problem. One paradigm to deal with this problem is to use some complicated methods for mapping all images into an artificial image space, which however will disrupt the natural image distribution and requires heavy image preprocessing. In this paper, we analyze the deficiencies of several widely-used objective functions handling image resolution discrepancies and propose a new framework called deep antithetical learning that directly learns from the natural image space rather than creating an arbitrary one. We first quantify and categorize original training images according to their resolutions. Then we create an antithetical training set and make sure that original training images have counterparts with antithetical resolutions in this new set. At last, a novel Contrastive Center Loss(CCL) is proposed to learn from images with different resolutions without being interfered by their resolution discrepancies. Extensive experimental analyses and evaluations indicate that the proposed framework, even using a vanilla deep ReID network, exhibits remarkable performance improvements. Without bells and whistles, our approach outperforms previous state-of-the-art methods by a large margin.

33.DSFD: Dual Shot Face Detector pdf

Recently, Convolutional Neural Network (CNN) has achieved great success in face detection. However, it remains a challenging problem for the current face detection methods owing to high degree of variability in scale, pose, occlusion, expression, appearance and illumination. In this paper, we propose a novel face detection network named Dual Shot face Detector(DSFD), which inherits the architecture of SSD and introduces a Feature Enhance Module (FEM) for transferring the original feature maps to extend the single shot detector to dual shot detector. Specially, Progressive Anchor Loss (PAL) computed by using two set of anchors is adopted to effectively facilitate the features. Additionally, we propose an Improved Anchor Matching (IAM) method by integrating novel data augmentation techniques and anchor design strategy in our DSFD to provide better initialization for the regressor. Extensive experiments on popular benchmarks: WIDER FACE (easy: $0.966$, medium: $0.957$, hard: $0.904$) and FDDB ( discontinuous: $0.991$, continuous: $0.862$) demonstrate the superiority of DSFD over the state-of-the-art face detectors (e.g., PyramidBox and SRN). Code will be made available upon publication.

34.Automated Evaluation of Semantic Segmentation Robustness for Autonomous Driving pdf

One of the fundamental challenges in the design of perception systems for autonomous vehicles is validating the performance of each algorithm under a comprehensive variety of operating conditions. In the case of vision-based semantic segmentation, there are known issues when encountering new scenarios that are sufficiently different to the training data. In addition, even small variations in environmental conditions such as illumination and precipitation can affect the classification performance of the segmentation model. Given the reliance on visual information, these effects often translate into poor semantic pixel classification which can potentially lead to catastrophic consequences when driving autonomously. This paper presents a novel method for analysing the robustness of semantic segmentation models and provides a number of metrics to evaluate the classification performance over a variety of environmental conditions. The process incorporates an additional sensor (lidar) to automate the process, eliminating the need for labour-intensive hand labelling of validation data. The system integrity can be monitored as the performance of the vision sensors are validated against a different sensor modality. This is necessary for detecting failures that are inherent to vision technology. Experimental results are presented based on multiple datasets collected at different times of the year with different environmental conditions. These results show that the semantic segmentation performance varies depending on the weather, camera parameters, existence of shadows, etc.. The results also demonstrate how the metrics can be used to compare and validate the performance after making improvements to a model, and compare the performance of different networks.

35.Fault Area Detection in Leaf Diseases using k-means Clustering pdf

With increasing population the crisis of food is getting bigger day by day.In this time of crisis,the leaf disease of crops is the biggest problem in the food industry.In this paper, we have addressed that problem and proposed an efficient method to detect leaf disease.Leaf diseases can be detected from sample images of the leaf with the help of image processing and segmentation.Using k-means clustering and Otsu's method the faulty region in a leaf is detected which helps to determine proper course of action to be taken.Further the ratio of normal and faulty region if calculated would be able to predict if the leaf can be cured at all.

36.Resolving Referring Expressions in Images With Labeled Elements pdf

Images may have elements containing text and a bounding box associated with them, for example, text identified via optical character recognition on a computer screen image, or a natural image with labeled objects. We present an end-to-end trainable architecture to incorporate the information from these elements and the image to segment/identify the part of the image a natural language expression is referring to. We calculate an embedding for each element and then project it onto the corresponding location (i.e., the associated bounding box) of the image feature map. We show that this architecture gives an improvement in resolving referring expressions, over only using the image, and other methods that incorporate the element information. We demonstrate experimental results on the referring expression datasets based on COCO, and on a webpage image referring expression dataset that we developed.

37.Background Subtraction using Compressed Low-resolution Images pdf

Image processing and recognition are an important part of the modern society, with applications in fields such as advanced artificial intelligence, smart assistants, and security surveillance. The essential first step involved in almost all the visual tasks is background subtraction with a static camera. Ensuring that this critical step is performed in the most efficient manner would therefore improve all aspects related to objects recognition and tracking, behavior comprehension, etc.. Although background subtraction method has been applied for many years, its application suffers from real-time requirement. In this letter, we present a novel approach in implementing the background subtraction. The proposed method uses compressed, low-resolution grayscale image for the background subtraction. These low-resolution grayscale images were found to preserve the salient information very well. To verify the feasibility of our methodology, two prevalent methods, ViBe and GMM, are used in the experiment. The results of the proposed methodology confirm the effectiveness of our approach.

38.AUNet: Breast Mass Segmentation of Whole Mammograms pdf

Deep learning based segmentation has seen rapid development lately in both natural and medical image processing. However, its application to mammographic mass segmentation is still a challenging task due to the low signal-to-noise ratio and the wide variety of mass shapes and sizes. In this study, we propose a new network, AUNet, for the breast mass segmentation. Different from most methods that need to extract mass-centered image patches, AUNet could directly process the whole mammograms. Furthermore, it introduces an asymmetrical structure to the traditional encoder-decoder segmentation architecture and proposes a new upsampling block, Attention Up (AU) Block. Especially, the AU block is designed to have three merits. Firstly, it compensates the information loss of bilinear upsampling by dense upsampling. Secondly, it designs a more effective method to fuse high- and low-level features. Thirdly, it includes a channel-attention function to highlight rich-information channels. We evaluated the proposed method on two publicly available datasets, CBIS-DDSM and INbreast. Compared to three existing fully convolutional networks, AUNet achieved the best performances with an average Dice similarity coefficient of 81.8% for CBIS-DDSM and 79.1% for INbreast.

39.A Deep-Learning-Based Fashion Attributes Detection Model pdf

Analyzing fashion attributes is essential in the fashion design process. Current fashion forecasting firms, such as WGSN utilizes information from all around the world (from fashion shows, visual merchandising, blogs, etc). They gather information by experience, by observation, by media scan, by interviews, and by exposed to new things. Such information analyzing process is called abstracting, which recognize similarities or differences across all the garments and collections. In fact, such abstraction ability is useful in many fashion careers with different purposes. Fashion forecasters abstract across design collections and across time to identify fashion change and directions; designers, product developers and buyers abstract across a group of garments and collections to develop a cohesive and visually appeal lines; sales and marketing executives abstract across product line each season to recognize selling points; fashion journalist and bloggers abstract across runway photos to recognize symbolic core concepts that can be translated into editorial features. Fashion attributes analysis for such fashion insiders requires much detailed and in-depth attributes annotation than that for consumers, and requires inference on multiple domains. In this project, we propose a data-driven approach for recognizing fashion attributes. Specifically, a modified version of Faster R-CNN model is trained on images from a large-scale localization dataset with 594 fine-grained attributes under different scenarios, for example in online stores and street snapshots. This model will then be used to detect garment items and classify clothing attributes for runway photos and fashion illustrations.

40.A Binary Optimization Approach for Constrained K-Means Clustering pdf

K-Means clustering still plays an important role in many computer vision problems. While the conventional Lloyd method, which alternates between centroid update and cluster assignment, is primarily used in practice, it may converge to a solution with empty clusters. Furthermore, some applications may require the clusters to satisfy a specific set of constraints, e.g., cluster sizes, must-link/cannot-link. Several methods have been introduced to solve constrained K-Means clustering. Due to the non-convex nature of K-Means, however, existing approaches may result in sub-optimal solutions that poorly approximate the true clusters. In this work, we provide a new perspective to tackle this problem. Particularly, we reconsider constrained K-Means as a Binary Optimization Problem and propose a novel optimization scheme to search for feasible solutions in the binary domain. This approach allows us to solve constrained K-Means where multiple types of constraints can be simultaneously enforced. Experimental results on synthetic and real datasets show that our method provides better clustering accuracy with faster runtime compared to several commonly used techniques.

41.End-to-End Diagnosis and Segmentation Learning from Cardiac Magnetic Resonance Imaging pdf

Cardiac magnetic resonance (CMR) is used extensively in the diagnosis and management of cardiovascular disease. Deep learning methods have proven to deliver segmentation results comparable to human experts in CMR imaging, but there have been no convincing results for the problem of end-to-end segmentation and diagnosis from CMR. This is in part due to a lack of sufficiently large datasets required to train robust diagnosis models. In this paper, we propose a learning method to train diagnosis models, where our approach is designed to work with relatively small datasets. In particular, the optimisation loss is based on multi-task learning that jointly trains for the tasks of segmentation and diagnosis classification. We hypothesize that segmentation has a regularizing effect on the learning of features relevant for diagnosis. Using the 100 training and 50 testing samples available from the Automated Cardiac Diagnosis Challenge (ACDC) dataset, which has a balanced distribution of 5 cardiac diagnoses, we observe a reduction of the classification error from 32% to 22%, and a faster convergence compared to a baseline without segmentation. To the best of our knowledge, this is the best diagnosis results from CMR using an end-to-end diagnosis and segmentation learning method.

42.Resource-Constrained Simultaneous Detection and Labeling of Objects in High-Resolution Satellite Images pdf

We describe a strategy for detection and classification of man-made objects in large high-resolution satellite photos under computational resource constraints. We detect and classify candidate objects by using five pipelines of convolutional neural network processing (CNN), run in parallel. Each pipeline has its own unique strategy for fine tunning parameters, proposal region filtering, and dealing with image scales. The conflicting region proposals are merged based on region confidence and not just based on overlap areas, which improves the quality of the final bounding-box regions selected. We demonstrate this strategy using the recent xView challenge, which is a complex benchmark with more than 1,100 high-resolution images, spanning 800,000 aerial objects around the world covering a total area of 1,400 square kilometers at 0.3 meter ground sample distance. To tackle the resource-constrained problem posed by the xView challenge, where inferences are restricted to be on CPU with 8GB memory limit, we used lightweight CNN's trained with the single shot detector algorithm. Our approach was competitive on sequestered sets; it was ranked third.

43.Structured Domain Randomization: Bridging the Reality Gap by Context-Aware Synthetic Data pdf

We present structured domain randomization (SDR), a variant of domain randomization (DR) that takes into account the structure and context of the scene. In contrast to DR, which places objects and distractors randomly according to a uniform probability distribution, SDR places objects and distractors randomly according to probability distributions that arise from the specific problem at hand. In this manner, SDR-generated imagery enables the neural network to take the context around an object into consideration during detection. We demonstrate the power of SDR for the problem of 2D bounding box car detection, achieving competitive results on real data after training only on synthetic data. On the KITTI easy, moderate, and hard tasks, we show that SDR outperforms other approaches to generating synthetic data (VKITTI, Sim 200k, or DR), as well as real data collected in a different domain (BDD100K). Moreover, synthetic SDR data combined with real KITTI data outperforms real KITTI data alone.

44.NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision pdf

Mobile vision systems such as smartphones, drones, and augmented-reality headsets are revolutionizing our lives. These systems usually run multiple applications concurrently and their available resources at runtime are dynamic due to events such as starting new applications, closing existing applications, and application priority changes. In this paper, we present NestDNN, a framework that takes the dynamics of runtime resources into account to enable resource-aware multi-tenant on-device deep learning for mobile vision systems. NestDNN enables each deep learning model to offer flexible resource-accuracy trade-offs. At runtime, it dynamically selects the optimal resource-accuracy trade-off for each deep learning model to fit the model's resource demand to the system's available runtime resources. In doing so, NestDNN efficiently utilizes the limited resources in mobile vision systems to jointly maximize the performance of all the concurrently running applications. Our experiments show that compared to the resource-agnostic status quo approach, NestDNN achieves as much as 4.2% increase in inference accuracy, 2.0x increase in video frame processing rate and 1.7x reduction on energy consumption.

45.A Fusion Approach for Multi-Frame Optical Flow Estimation pdf

To date, top-performing optical flow estimation methods only take pairs of consecutive frames into account. While elegant and appealing, the idea of using more than two frames has not yet produced state-of-the-art results. We present a simple, yet effective fusion approach for multi-frame optical flow that benefits from longer-term temporal cues. Our method first warps the optical flow from previous frames to the current, thereby yielding multiple plausible estimates. It then fuses the complementary information carried by these estimates into a new optical flow field. At the time of submission, our method ranks first among published flow methods in the MPI Sintel and KITTI 2015 benchmarks.

46.DeepLSR: Deep learning approach for laser speckle reduction pdf

We present a deep learning approach for laser speckle reduction ('DeepLSR') on images illuminated with a multi-wavelength, red-green-blue laser. We acquired a set of images from a variety of objects illuminated with laser light, both with and without optical speckle reduction, and an incoherent light-emitting diode. An adversarial network was then trained for paired image-to-image translation to transform images from a source domain of coherent illumination to a target domain of incoherent illumination. When applied to a new image set of coherently-illuminated test objects, this network reconstructs incoherently-illuminated images with an average peak signal-to-noise ratio and structural similarity index of 36 dB and 0.91, respectively, compared to 30 dB and 0.88 using optical speckle reduction, and 30 dB and 0.88 using non-local means processing. We demonstrate proof-of-concept for speckle-reduced laser endoscopy by applying DeepLSR to images of ex-vivo gastrointestinal tissue illuminated with a fiber-coupled laser source. For applications that require speckle-reduced imaging, DeepLSR holds promise to enable the use of coherent sources that are more stable, efficient, compact, and brighter than conventional alternatives.