ArXiv cs.CV --Fri, 4 Sep 2020

1.Flow-edge Guided Video Completion ⬇️

We present a new flow-based video completion algorithm. Previous flow completion methods are often unable to retain the sharpness of motion boundaries. Our method first extracts and completes motion edges, and then uses them to guide piecewise-smooth flow completion with sharp edges. Existing methods propagate colors among local flow connections between adjacent frames. However, not all missing regions in a video can be reached in this way because the motion boundaries form impenetrable barriers. Our method alleviates this problem by introducing non-local flow connections to temporally distant frames, enabling propagating video content over motion boundaries. We validate our approach on the DAVIS dataset. Both visual and quantitative results show that our method compares favorably against the state-of-the-art algorithms.

2.Computational Analysis of Deformable Manifolds: from Geometric Modelling to Deep Learning ⬇️

Leo Tolstoy opened his monumental novel Anna Karenina with the now famous words: Happy families are all alike; every unhappy family is unhappy in its own way A similar notion also applies to mathematical spaces: Every flat space is alike; every unflat space is unflat in its own way. However, rather than being a source of unhappiness, we will show that the diversity of non-flat spaces provides a rich area of study. The genesis of the so-called big data era and the proliferation of social and scientific databases of increasing size has led to a need for algorithms that can efficiently process, analyze and, even generate high dimensional data. However, the curse of dimensionality leads to the fact that many classical approaches do not scale well with respect to the size of these problems. One technique to avoid some of these ill-effects is to exploit the geometric structure of coherent data. In this thesis, we will explore geometric methods for shape processing and data analysis. More specifically, we will study techniques for representing manifolds and signals supported on them through a variety of mathematical tools including, but not limited to, computational differential geometry, variational PDE modeling, and deep learning. First, we will explore non-isometric shape matching through variational modeling. Next, we will use ideas from parallel transport on manifolds to generalize convolution and convolutional neural networks to deformable manifolds. Finally, we conclude by proposing a novel auto-regressive model for capturing the intrinsic geometry and topology of data. Throughout this work, we will use the idea of computing correspondences as a though-line to both motivate our work and analyze our results.

3.Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild ⬇️

Deep learning-based scene text detection can achieve preferable performance, powered with sufficient labeled training data. However, manual labeling is time consuming and laborious. At the extreme, the corresponding annotated data are unavailable. Exploiting synthetic data is a very promising solution except for domain distribution mismatches between synthetic datasets and real datasets. To address the severe domain distribution mismatch, we propose a synthetic-to-real domain adaptation method for scene text detection, which transfers knowledge from synthetic data (source domain) to real data (target domain). In this paper, a text self-training (TST) method and adversarial text instance alignment (ATA) for domain adaptive scene text detection are introduced. ATA helps the network learn domain-invariant features by training a domain classifier in an adversarial manner. TST diminishes the adverse effects of false positives~(FPs) and false negatives~(FNs) from inaccurate pseudo-labels. Two components have positive effects on improving the performance of scene text detectors when adapting from synthetic-to-real scenes. We evaluate the proposed method by transferring from SynthText, VISD to ICDAR2015, ICDAR2013. The results demonstrate the effectiveness of the proposed method with up to 10% improvement, which has important exploration significance for domain adaptive scene text detection. Code is available at this https URL

4.MIPGAN -- Generating Robust and High QualityMorph Attacks Using Identity Prior Driven GAN ⬇️

Face morphing attacks target to circumvent Face Recognition Systems (FRS) by employing face images derived from multiple data subjects (e.g., accomplices and malicious actors). Morphed images can verify against contributing data subjects with a reasonable success rate, given they have a high degree of identity resemblance. The success of the morphing attacks is directly dependent on the quality of the generated morph images. We present a new approach for generating robust attacks extending our earlier framework for generating face morphs. We present a new approach using an Identity Prior Driven Generative Adversarial Network, which we refer to as \textit{MIPGAN (Morphing through Identity Prior driven GAN)}. The proposed MIPGAN is derived from the StyleGAN with a newly formulated loss function exploiting perceptual quality and identity factor to generate a high quality morphed face image with minimal artifacts and with higher resolution. We demonstrate the proposed approach's applicability to generate robust morph attacks by evaluating it against a commercial Face Recognition System (FRS) and demonstrate the success rate of attacks. Extensive experiments are carried out to assess the FRS's vulnerability against the proposed morphed face generation technique on three types of data such as digital images, re-digitized (printed and scanned) images, and compressed images after re-digitization from newly generated \textit{MIPGAN Face Morph Dataset}. The obtained results demonstrate that the proposed approach of morph generation profoundly threatens the FRS.

5.Multi-Loss Weighting with Coefficient of Variations ⬇️

Many interesting tasks in machine learning and computer vision are learned by optimising an objective function defined as a weighted linear combination of multiple losses. The final performance is sensitive to choosing the correct (relative) weights for these losses. Finding a good set of weights is often done by adopting them into the set of hyper-parameters, which are set using an extensive grid search. This is computationally expensive. In this paper, the weights are defined based on properties observed while training the model, including the specific batch loss, the average loss, and the variance for each of the losses. An additional advantage is that the defined weights evolve during training, instead of using static loss weights. In literature, loss weighting is mostly used in a multi-task learning setting, where the different tasks obtain different weights. However, there is a plethora of single-task multi-loss problems that can benefit from automatic loss weighting. In this paper, it is shown that these multi-task approaches do not work on single tasks. Instead, a method is proposed that automatically and dynamically tunes loss weights throughout training specifically for single-task multi-loss problems. The method incorporates a measure of uncertainty to balance the losses. The validity of the approach is shown empirically for different tasks on multiple datasets.

6.Future Frame Prediction of a Video Sequence ⬇️

Predicting future frames of a video sequence has been a problem of high interest in the field of Computer Vision as it caters to a multitude of applications. The ability to predict, anticipate and reason about future events is the essence of intelligence and one of the main goals of decision-making systems such as human-machine interaction, robot navigation and autonomous driving. However, the challenge lies in the ambiguous nature of the problem as there may be multiple future sequences possible for the same input video shot. A naively designed model averages multiple possible futures into a single blurry prediction.
Recently, two distinct approaches have attempted to address this problem as: (a) use of latent variable models that represent underlying stochasticity and (b) adversarially trained models that aim to produce sharper images. A latent variable model often struggles to produce realistic results, while an adversarially trained model underutilizes latent variables and thus fails to produce diverse predictions. These methods have revealed complementary strengths and weaknesses. Combining the two approaches produces predictions that appear more realistic and better cover the range of plausible futures. This forms the basis and objective of study in this project work.
In this paper, we proposed a novel multi-scale architecture combining both approaches. We validate our proposed model through a series of experiments and empirical evaluations on Moving MNIST, UCF101, and Penn Action datasets. Our method outperforms the results obtained using the baseline methods.

7.Multi-domain semantic segmentation with pyramidal fusion ⬇️

We present our submission to the semantic segmentation contest of the Robust Vision Challenge held at ECCV 2020. The contest requires submitting the same model to seven benchmarks from three different domains. Our approach is based on the SwiftNet architecture with pyramidal fusion. We address inconsistent taxonomies with a single-level 193-dimensional softmax output. We strive to train with large batches in order to stabilize optimization of a hard recognition problem, and to favour smooth evolution of batchnorm statistics. We achieve this by implementing a custom backward step through log-sum-prob loss, and by using small crops before freezing the population statistics. Our model ranks first on the RVC semantic segmentation challenge as well as on the WildDash 2 leaderboard. This suggests that pyramidal fusion is competitive not only for efficient inference with lightweight backbones, but also in large-scale setups for multi-domain application.

8.Modification method for single-stage object detectors that allows to exploit the temporal behaviour of a scene to improve detection accuracy ⬇️

A simple modification method for single-stage generic object detection neural networks, such as YOLO and SSD, is proposed, which allows for improving the detection accuracy on video data by exploiting the temporal behavior of the scene in the detection pipeline. It is shown that, using this method, the detection accuracy of the base network can be considerably improved, especially for occluded and hidden objects. It is shown that a modified network is more prone to detect hidden objects with more confidence than an unmodified one. A weakly supervised training method is proposed, which allows for training a modified network without requiring any additional annotated data.

9.Few-shot Object Detection with Feature Attention Highlight Module in Remote Sensing Images ⬇️

In recent years, there are many applications of object detection in remote sensing field, which demands a great number of labeled data. However, in many cases, data is extremely rare. In this paper, we proposed a few-shot object detector which is designed for detecting novel objects based on only a few examples. Through fully leveraging labeled base classes, our model that is composed of a feature-extractor, a feature attention highlight module as well as a two-stage detection backend can quickly adapt to novel classes. The pre-trained feature extractor whose parameters are shared produces general features. While the feature attention highlight module is designed to be light-weighted and simple in order to fit the few-shot cases. Although it is simple, the information provided by it in a serial way is helpful to make the general features to be specific for few-shot objects. Then the object-specific features are delivered to the two-stage detection backend for the detection results. The experiments demonstrate the effectiveness of the proposed method for few-shot cases.

10.SCG-Net: Self-Constructing Graph Neural Networks for Semantic Segmentation ⬇️

Capturing global contextual representations by exploiting long-range pixel-pixel dependencies has shown to improve semantic segmentation performance. However, how to do this efficiently is an open question as current approaches of utilising attention schemes or very deep models to increase the models field of view, result in complex models with large memory consumption. Inspired by recent work on graph neural networks, we propose the Self-Constructing Graph (SCG) module that learns a long-range dependency graph directly from the image and uses it to propagate contextual information efficiently to improve semantic segmentation. The module is optimised via a novel adaptive diagonal enhancement method and a variational lower bound that consists of a customized graph reconstruction term and a Kullback-Leibler divergence regularization term. When incorporated into a neural network (SCG-Net), semantic segmentation is performed in an end-to-end manner and competitive performance (mean F1-scores of 92.0% and 89.8% respectively) on the publicly available ISPRS Potsdam and Vaihingen datasets is achieved, with much fewer parameters, and at a lower computational cost compared to related pure convolutional neural network (CNN) based models.

11.Layer-specific Optimization for Mixed Data Flow with Mixed Precision in FPGA Design for CNN-based Object Detectors ⬇️

Convolutional neural networks (CNNs) require both intensive computation and frequent memory access, which lead to a low processing speed and large power dissipation. Although the characteristics of the different layers in a CNN are frequently quite different, previous hardware designs have employed common optimization schemes for them. This paper proposes a layer-specific design that employs different organizations that are optimized for the different layers. The proposed design employs two layer-specific optimizations: layer-specific mixed data flow and layer-specific mixed precision. The mixed data flow aims to minimize the off-chip access while demanding a minimal on-chip memory (BRAM) resource of an FPGA device. The mixed precision quantization is to achieve both a lossless accuracy and an aggressive model compression, thereby further reducing the off-chip access. A Bayesian optimization approach is used to select the best sparsity for each layer, achieving the best trade-off between the accuracy and compression. This mixing scheme allows the entire network model to be stored in BRAMs of the FPGA to aggressively reduce the off-chip access, and thereby achieves a significant performance enhancement. The model size is reduced by 22.66-28.93 times compared to that in a full-precision network with a negligible degradation of accuracy on VOC, COCO, and ImageNet datasets. Furthermore, the combination of mixed dataflow and mixed precision significantly outperforms the previous works in terms of both throughput, off-chip access, and on-chip memory requirement.

12.DESC: Domain Adaptation for Depth Estimation via Semantic Consistency ⬇️

Accurate real depth annotations are difficult to acquire, needing the use of special devices such as a LiDAR sensor. Self-supervised methods try to overcome this problem by processing video or stereo sequences, which may not always be available. Instead, in this paper, we propose a domain adaptation approach to train a monocular depth estimation model using a fully-annotated source dataset and a non-annotated target dataset. We bridge the domain gap by leveraging semantic predictions and low-level edge features to provide guidance for the target domain. We enforce consistency between the main model and a second model trained with semantic segmentation and edge maps, and introduce priors in the form of instance heights. Our approach is evaluated on standard domain adaptation benchmarks for monocular depth estimation and show consistent improvement upon the state-of-the-art.

13.Auto-Classifier: A Robust Defect Detector Based on an AutoML Head ⬇️

The dominant approach for surface defect detection is the use of hand-crafted feature-based methods. However, this falls short when conditions vary that affect extracted images. So, in this paper, we sought to determine how well several state-of-the-art Convolutional Neural Networks perform in the task of surface defect detection. Moreover, we propose two methods: CNN-Fusion, that fuses the prediction of all the networks into a final one, and Auto-Classifier, which is a novel proposal that improves a Convolutional Neural Network by modifying its classification component using AutoML. We carried out experiments to evaluate the proposed methods in the task of surface defect detection using different datasets from DAGM2007. We show that the use of Convolutional Neural Networks achieves better results than traditional methods, and also, that Auto-Classifier out-performs all other methods, by achieving 100% accuracy and 100% AUC results throughout all the datasets.

14.1st Place Solution of LVIS Challenge 2020: A Good Box is not a Guarantee of a Good Mask ⬇️

This article introduces the solutions of the team lvisTraveler for LVIS Challenge 2020. In this work, two characteristics of LVIS dataset are mainly considered: the long-tailed distribution and high quality instance segmentation mask. We adopt a two-stage training pipeline. In the first stage, we incorporate EQL and self-training to learn generalized representation. In the second stage, we utilize Balanced GroupSoftmax to promote the classifier, and propose a novel proposal assignment strategy and a new balanced mask loss for mask head to get more precise mask predictions. Finally, we achieve 41.5 and 41.2 AP on LVIS v1.0 val and test-dev splits respectively, outperforming the baseline based on X101-FPN-MaskRCNN by a large margin.

15.Physics-based Shading Reconstruction for Intrinsic Image Decomposition ⬇️

We investigate the use of photometric invariance and deep learning to compute intrinsic images (albedo and shading). We propose albedo and shading gradient descriptors which are derived from physics-based models. Using the descriptors, albedo transitions are masked out and an initial sparse shading map is calculated directly from the corresponding RGB image gradients in a learning-free unsupervised manner. Then, an optimization method is proposed to reconstruct the full dense shading map. Finally, we integrate the generated shading map into a novel deep learning framework to refine it and also to predict corresponding albedo image to achieve intrinsic image decomposition. By doing so, we are the first to directly address the texture and intensity ambiguity problems of the shading estimations. Large scale experiments show that our approach steered by physics-based invariant descriptors achieve superior results on MIT Intrinsics, NIR-RGB Intrinsics, Multi-Illuminant Intrinsic Images, Spectral Intrinsic Images, As Realistic As Possible, and competitive results on Intrinsic Images in the Wild datasets while achieving state-of-the-art shading estimations.

16.A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports ⬇️

Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (V+L) tasks, including medical visual question answering, clinical image-text retrieval, clinical report auto-generation. In this study, we adopt four pre-trained V+L models: LXMERT, VisualBERT, UNIER and PixelBERT to learn multimodal representation from MIMIC-CXR radiographs and associated reports. The extrinsic evaluation on OpenI dataset shows that in comparison to the pioneering CNN-RNN model, the joint embedding learned by pre-trained V+L models demonstrate performance improvement in the thoracic findings classification task. We conduct an ablation study to analyze the contribution of certain model components and validate the advantage of joint embedding over text-only embedding. We also visualize attention maps to illustrate the attention mechanism of V+L models.

17.TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback ⬇️

The ability to efficiently search for images over an indexed database is the cornerstone for several user experiences. Incorporating user feedback, through multi-modal inputs provide flexible and interaction to serve fine-grained specificity in requirements. We specifically focus on text feedback, through descriptive natural language queries. Given a reference image and textual user feedback, our goal is to retrieve images that satisfy constraints specified by both of these input modalities. The task is challenging as it requires understanding the textual semantics from the text feedback and then applying these changes to the visual representation. To address these challenges, we propose a novel architecture TRACE which contains a hierarchical feature aggregation module to learn the composite visio-linguistic representations. TRACE achieves the SOTA performance on 3 benchmark datasets: FashionIQ, Shoes, and Birds-to-Words, with an average improvement of at least ~5.7%, ~3%, and ~5% respectively in R@K metric. Our extensive experiments and ablation studies show that TRACE consistently outperforms the existing techniques by significant margins both quantitatively and qualitatively.

18.Modeling Global Body Configurations in American Sign Language ⬇️

American Sign Language (ASL) is the fourth most commonly used language in the United States and is the language most commonly used by Deaf people in the United States and the English-speaking regions of Canada. Unfortunately, until recently, ASL received little research. This is due, in part, to its delayed recognition as a language until William C. Stokoe's publication in 1960. Limited data has been a long-standing obstacle to ASL research and computational modeling. The lack of large-scale datasets has prohibited many modern machine-learning techniques, such as Neural Machine Translation, from being applied to ASL. In addition, the modality required to capture sign language (i.e. video) is complex in natural settings (as one must deal with background noise, motion blur, and the curse of dimensionality). Finally, when compared with spoken languages, such as English, there has been limited research conducted into the linguistics of ASL.
We realize a simplified version of Liddell and Johnson's Movement-Hold (MH) Model using a Probabilistic Graphical Model (PGM). We trained our model on ASLing, a dataset collected from three fluent ASL signers. We evaluate our PGM against other models to determine its ability to model ASL. Finally, we interpret various aspects of the PGM and draw conclusions about ASL phonetics. The main contributions of this paper are

19.Adherent Mist and Raindrop Removal from a Single Image Using Attentive Convolutional Network ⬇️

Temperature difference-induced mist adhered to the windshield, camera lens, etc. are often inhomogeneous and obscure, which can easily obstruct the vision and degrade the image severely. Together with adherent raindrops, they bring considerable challenges to various vision systems but without enough attention. Recent methods for similar problems typically use hand-crafted priors to generate spatial attention maps. In this work, we propose to visually remove the adherent mist and raindrop jointly from a single image using attentive convolutional neural networks. We apply classification activation map attention to our model to strengthen the spatial attention without hand-crafted priors. In addition, the smoothed dilated convolution is adopted to obtain a large receptive field without spatial information loss, and the dual attention module is utilized for efficiently selecting channels and spatial features. Our experiments show our method achieves state-of-the-art performance, and demonstrate that this underrated practical problem is critical to high-level vision scenes.

20.Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding ⬇️

The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. In this paper, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., expression-agnostic), hoping that the proposals contain all right instances in the expression (i.e., expression-aware). Due to this mismatch, current two-stage methods suffer from a severe performance drop between detected and ground-truth proposals. To this end, we propose Ref-NMS, which is the first method to yield expression-aware proposals at the first stage. Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object. These scores can guide the NMSoperation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects, resulting in a significantly improved grounding performance. Since Ref-NMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method. Extensive ablation studies on several backbones, benchmarks, and tasks consistently demonstrate the superiority of Ref-NMS.

21.Tasks Integrated Networks: Joint Detection and Retrieval for Image Search ⬇️

The traditional object retrieval task aims to learn a discriminative feature representation with intra-similarity and inter-dissimilarity, which supposes that the objects in an image are manually or automatically pre-cropped exactly. However, in many real-world searching scenarios (e.g., video surveillance), the objects (e.g., persons, vehicles, etc.) are seldom accurately detected or annotated. Therefore, object-level retrieval becomes intractable without bounding-box annotation, which leads to a new but challenging topic, i.e. image-level search. In this paper, to address the image search issue, we first introduce an end-to-end Integrated Net (I-Net), which has three merits: 1) A Siamese architecture and an on-line pairing strategy for similar and dissimilar objects in the given images are designed. 2) A novel on-line pairing (OLP) loss is introduced with a dynamic feature dictionary, which alleviates the multi-task training stagnation problem, by automatically generating a number of negative pairs to restrict the positives. 3) A hard example priority (HEP) based softmax loss is proposed to improve the robustness of classification task by selecting hard categories. With the philosophy of divide and conquer, we further propose an improved I-Net, called DC-I-Net, which makes two new contributions: 1) two modules are tailored to handle different tasks separately in the integrated framework, such that the task specification is guaranteed. 2) A class-center guided HEP loss (C2HEP) by exploiting the stored class centers is proposed, such that the intra-similarity and inter-dissimilarity can be captured for ultimate retrieval. Extensive experiments on famous image-level search oriented benchmark datasets demonstrate that the proposed DC-I-Net outperforms the state-of-the-art tasks-integrated and tasks-separated image search models.

22.Spatial Transformer Point Convolution ⬇️

Point clouds are unstructured and unordered in the embedded 3D space. In order to produce consistent responses under different permutation layouts, most existing methods aggregate local spatial points through maximum or summation operation. But such an aggregation essentially belongs to the isotropic filtering on all operated points therein, which tends to lose the information of geometric structures. In this paper, we propose a spatial transformer point convolution (STPC) method to achieve anisotropic convolution filtering on point clouds. To capture and represent implicit geometric structures, we specifically introduce spatial direction dictionary to learn those latent geometric components. To better encode unordered neighbor points, we design sparse deformer to transform them into the canonical ordered dictionary space by using direction dictionary learning. In the transformed space, the standard image-like convolution can be leveraged to generate anisotropic filtering, which is more robust to express those finer variances of local regions. Dictionary learning and encoding processes are encapsulated into a network module and jointly learnt in an end-to-end manner. Extensive experiments on several public datasets (including S3DIS, Semantic3D, SemanticKITTI) demonstrate the effectiveness of our proposed method in point clouds semantic segmentation task.

23.Noise-Aware Texture-Preserving Low-Light Enhancement ⬇️

A simple and effective low-light image enhancement method based on a noise-aware texture-preserving retinex model is proposed in this work. The new method, called NATLE, attempts to strike a balance between noise removal and natural texture preservation through a low-complexity solution. Its cost function includes an estimated piece-wise smooth illumination map and a noise-free texture-preserving reflectance map. Afterwards, illumination is adjusted to form the enhanced image together with the reflectance map. Extensive experiments are conducted on common low-light image enhancement datasets to demonstrate the superior performance of NATLE.

24.Towards Practical Implementations of Person Re-Identification from Full Video Frames ⬇️

With the major adoption of automation for cities security, person re-identification (Re-ID) has been extensively studied recently. In this paper, we argue that the current way of studying person re-identification, i.e. by trying to re-identify a person within already detected and pre-cropped images of people, is not sufficient to implement practical security applications, where the inputs to the system are the full frames of the video streams. To support this claim, we introduce the Full Frame Person Re-ID setting (FF-PRID) and define specific metrics to evaluate FF-PRID implementations. To improve robustness, we also formalize the hybrid human-machine collaboration framework, which is inherent to any Re-ID security applications. To demonstrate the importance of considering the FF-PRID setting, we build an experiment showing that combining a good people detection network with a good Re-ID model does not necessarily produce good results for the final application. This underlines a failure of the current formulation in assessing the quality of a Re-ID model and justifies the use of different metrics. We hope that this work will motivate the research community to consider the full problem in order to develop algorithms that are better suited to real-world scenarios.

25.NITES: A Non-Parametric Interpretable Texture Synthesis Method ⬇️

A non-parametric interpretable texture synthesis method, called the NITES method, is proposed in this work. Although automatic synthesis of visually pleasant texture can be achieved by deep neural networks nowadays, the associated generation models are mathematically intractable and their training demands higher computational cost. NITES offers a new texture synthesis solution to address these shortcomings. NITES is mathematically transparent and efficient in training and inference. The input is a single exemplary texture image. The NITES method crops out patches from the input and analyzes the statistical properties of these texture patches to obtain their joint spatial-spectral representations. Then, the probabilistic distributions of samples in the joint spatial-spectral spaces are characterized. Finally, numerous texture images that are visually similar to the exemplary texture image can be generated automatically. Experimental results are provided to show the superior quality of generated texture images and efficiency of the proposed NITES method in terms of both training and inference time.

26.Robust Object Classification Approach using Spherical Harmonics ⬇️

In this paper, we present a robust spherical harmonics approach for the classification of point cloud-based objects. Spherical harmonics have been used for classification over the years, with several frameworks existing in the literature. These approaches use variety of spherical harmonics based descriptors to classify objects. We first investigated these frameworks robustness against data augmentation, such as outliers and noise, as it has not been studied before. Then we propose a spherical convolution neural network framework for robust object classification. The proposed framework uses the voxel grid of concentric spheres to learn features over the unit ball. Our proposed model learn features that are less sensitive to data augmentation due to the selected sampling strategy and the designed convolution operation. We tested our proposed model against several types of data augmentation, such as noise and outliers. Our results show that the proposed model outperforms the state of art networks in terms of robustness to data augmentation.

27.Unsupervised Point Cloud Registration via Salient Points Analysis (SPA) ⬇️

An unsupervised point cloud registration method, called salient points analysis (SPA), is proposed in this work. The proposed SPA method can register two point clouds effectively using only a small subset of salient points. It first applies the PointHop++ method to point clouds, finds corresponding salient points in two point clouds based on the local surface characteristics of points and performs registration by matching the corresponding salient points. The SPA method offers several advantages over the recent deep learning based solutions for registration. Deep learning methods such as PointNetLK and DCP train end-to-end networks and rely on full supervision (namely, ground truth transformation matrix and class label). In contrast, the SPA is completely unsupervised. Furthermore, SPA's training time and model size are much less. The effectiveness of the SPA method is demonstrated by experiments on seen and unseen classes and noisy point clouds from the ModelNet-40 dataset.

28.Unsupervised Feedforward Feature (UFF) Learning for Point Cloud Classification and Segmentation ⬇️

In contrast to supervised backpropagation-based feature learning in deep neural networks (DNNs), an unsupervised feedforward feature (UFF) learning scheme for joint classification and segmentation of 3D point clouds is proposed in this work. The UFF method exploits statistical correlations of points in a point cloud set to learn shape and point features in a one-pass feedforward manner through a cascaded encoder-decoder architecture. It learns global shape features through the encoder and local point features through the concatenated encoder-decoder architecture. The extracted features of an input point cloud are fed to classifiers for shape classification and part segmentation. Experiments are conducted to evaluate the performance of the UFF method. For shape classification, the UFF is superior to existing unsupervised methods and on par with state-of-the-art DNNs. For part segmentation, the UFF outperforms semi-supervised methods and performs slightly worse than DNNs.

29.Efficiency in Real-time Webcam Gaze Tracking ⬇️

Efficiency and ease of use are essential for practical applications of camera based eye/gaze-tracking. Gaze tracking involves estimating where a person is looking on a screen based on face images from a computer-facing camera. In this paper we investigate two complementary forms of efficiency in gaze tracking: 1. The computational efficiency of the system which is dominated by the inference speed of a CNN predicting gaze-vectors; 2. The usability efficiency which is determined by the tediousness of the mandatory calibration of the gaze-vector to a computer screen. To do so, we evaluate the computational speed/accuracy trade-off for the CNN and the calibration effort/accuracy trade-off for screen calibration. For the CNN, we evaluate the full face, two-eyes, and single eye input. For screen calibration, we measure the number of calibration points needed and evaluate three types of calibration: 1. pure geometry, 2. pure machine learning, and 3. hybrid geometric regression. Results suggest that a single eye input and geometric regression calibration achieve the best trade-off.

30.CNN-Based Ultrasound Image Reconstruction for Ultrafast Displacement Tracking ⬇️

Thanks to its capability of acquiring full-view frames at multiple kilohertz, ultrafast ultrasound imaging unlocked the analysis of rapidly changing physical phenomena in the human body, with pioneering applications such as ultrasensitive flow imaging in the cardiovascular system or shear-wave elastography. The accuracy achievable with these motion estimation techniques is strongly contingent upon two contradictory requirements: a high quality of consecutive frames and a high frame rate. Indeed, the image quality can usually be improved by increasing the number of steered ultrafast acquisitions, but at the expense of a reduced frame rate and possible motion artifacts. To achieve accurate motion estimation at uncompromised frame rates and immune to motion artifacts, the proposed approach relies on single ultrafast acquisitions to reconstruct high-quality frames and on only two consecutive frames to obtain 2-D displacement estimates. To this end, we deployed a convolutional neural network-based image reconstruction method combined with a speckle tracking algorithm based on cross-correlation. Numerical and in vivo experiments, conducted in the context of plane-wave imaging, demonstrate that the proposed approach is capable of estimating displacements in regions where the presence of side lobe and grating lobe artifacts prevents any displacement estimation with a state-of-the-art technique that rely on conventional delay-and-sum beamforming. The proposed approach may therefore unlock the full potential of ultrafast ultrasound, in applications such as ultrasensitive cardiovascular motion and flow analysis or shear-wave elastography.

31.Limited View Tomographic Reconstruction Using a Deep Recurrent Framework with Residual Dense Spatial-Channel Attention Network and Sinogram Consistency ⬇️

Limited view tomographic reconstruction aims to reconstruct a tomographic image from a limited number of sinogram or projection views arising from sparse view or limited angle acquisitions that reduce radiation dose or shorten scanning time. However, such a reconstruction suffers from high noise and severe artifacts due to the incompleteness of sinogram. To derive quality reconstruction, previous state-of-the-art methods use UNet-like neural architectures to directly predict the full view reconstruction from limited view data; but these methods leave the deep network architecture issue largely intact and cannot guarantee the consistency between the sinogram of the reconstructed image and the acquired sinogram, leading to a non-ideal reconstruction. In this work, we propose a novel recurrent reconstruction framework that stacks the same block multiple times. The recurrent block consists of a custom-designed residual dense spatial-channel attention network. Further, we develop a sinogram consistency layer interleaved in our recurrent framework in order to ensure that the sampled sinogram is consistent with the sinogram of the intermediate outputs of the recurrent blocks. We evaluate our methods on two datasets. Our experimental results on AAPM Low Dose CT Grand Challenge datasets demonstrate that our algorithm achieves a consistent and significant improvement over the existing state-of-the-art neural methods on both limited angle reconstruction (over 5dB better in terms of PSNR) and sparse view reconstruction (about 4dB better in term of PSNR). In addition, our experimental results on Deep Lesion datasets demonstrate that our method is able to generate high-quality reconstruction for 8 major lesion types.

32.Software Effort Estimation using parameter tuned Models ⬇️

Software estimation is one of the most important activities in the software project. The software effort estimation is required in the early stages of software life cycle. Project Failure is the major problem undergoing nowadays as seen by software project managers. The imprecision of the estimation is the reason for this problem. Assize of software size grows, it also makes a system complex, thus difficult to accurately predict the cost of software development process. The greatest pitfall of the software industry was the fast-changing nature of software development which has made it difficult to develop parametric models that yield high accuracy for software development in all domains. We need the development of useful models that accurately predict the cost of developing a software product. This study presents the novel analysis of various regression models with hyperparameter tuning to get the effective model. Nine different regression techniques are considered for model development

33.Heightmap Reconstruction of Macula on Color Fundus Images Using Conditional Generative Adversarial Networks ⬇️

For medical diagnosis based on retinal images, a clear understanding of 3D structure is often required but due to the 2D nature of images captured, we cannot infer that information. However, by utilizing 3D reconstruction methods, we can construct the 3D structure of the macula area on fundus images which can be helpful for diagnosis and screening of macular disorders. Recent approaches have used shading information for 3D reconstruction or heightmap prediction but their output was not accurate since they ignored the dependency between nearby pixels. Additionally, other methods were dependent on the availability of more than one image of the eye which is not available in practice. In this paper, we use conditional generative adversarial networks (cGANs) to generate images that contain height information of the macula area on a fundus image. Results using our dataset show a 0.6077 improvement in Structural Similarity Index (SSIM) and 0.071 improvements in Mean Squared Error (MSE) metric over Shape from Shading (SFS) method. Additionally, Qualitative studies also indicate that our method outperforms recent approaches.

34.Multimodal brain tumor classification ⬇️

Cancer is a complex disease that provides various types of information depending on the scale of observation. While most tumor diagnostics are performed by observing histopathological slides, radiology images should yield additional knowledge towards the efficacy of cancer diagnostics. This work investigates a deep learning method combining whole slide images and magnetic resonance images to classify tumors. Experiments are prospectively conducted on the 2020 Computational Precision Medicine challenge, in a 3-classes unbalanced classification task. We report cross-validation (resp. validation) balanced-accuracy, kappa and f1 of 0.913, 0.897 and 0.951 (resp. 0.91, 0.90 and 0.94). The complete code of the method is open-source at XXXX. Those include histopathological data pre-processing, and can therefore be used off-the-shelf for other histopathological and/or radiological classification.

35.Detection-Aware Trajectory Generation for a Drone Cinematographer ⬇️

This work investigates an efficient trajectory generation for chasing a dynamic target, which incorporates the detectability objective. The proposed method actively guides the motion of a cinematographer drone so that the color of a target is well-distinguished against the colors of the background in the view of the drone. For the objective, we define a measure of color detectability given a chasing path. After computing a discrete path optimized for the metric, we generate a dynamically feasible trajectory. The whole pipeline can be updated on-the-fly to respond to the motion of the target. For the efficient discrete path generation, we construct a directed acyclic graph (DAG) for which a topological sorting can be determined analytically without the depth-first search. The smooth path is obtained in quadratic programming (QP) framework. We validate the enhanced performance of state-of-the-art object detection and tracking algorithms when the camera drone executes the trajectory obtained from the proposed method.

36.Fundus Image Analysis for Age Related Macular Degeneration: ADAM-2020 Challenge Report ⬇️

Age related macular degeneration (AMD) is one of the major causes for blindness in the elderly population. In this report, we propose deep learning based methods for retinal analysis using color fundus images for computer aided diagnosis of AMD. We leverage the recent state of the art deep networks for building a single fundus image based AMD classification pipeline. We also propose methods for the other directly relevant and auxiliary tasks such as lesions detection and segmentation, fovea detection and optic disc segmentation. We propose the use of generative adversarial networks (GANs) for the tasks of segmentation and detection. We also propose a novel method of fovea detection using GANs.

37.TopoMap: A 0-dimensional Homology Preserving Projection of High-Dimensional Data ⬇️

Multidimensional Projection is a fundamental tool for high-dimensional data analytics and visualization. With very few exceptions, projection techniques are designed to map data from a high-dimensional space to a visual space so as to preserve some dissimilarity (similarity) measure, such as the Euclidean distance for example. In fact, although adopting distinct mathematical formulations designed to favor different aspects of the data, most multidimensional projection methods strive to preserve dissimilarity measures that encapsulate geometric properties such as distances or the proximity relation between data objects. However, geometric relations are not the only interesting property to be preserved in a projection. For instance, the analysis of particular structures such as clusters and outliers could be more reliably performed if the mapping process gives some guarantee as to topological invariants such as connected components and loops. This paper introduces TopoMap, a novel projection technique which provides topological guarantees during the mapping process. In particular, the proposed method performs the mapping from a high-dimensional space to a visual space, while preserving the 0-dimensional persistence diagram of the Rips filtration of the high-dimensional data, ensuring that the filtrations generate the same connected components when applied to the original as well as projected data. The presented case studies show that the topological guarantee provided by TopoMap not only brings confidence to the visual analytic process but also can be used to assist in the assessment of other projection methods.

38.TAP-Net: Transport-and-Pack using Reinforcement Learning ⬇️

We introduce the transport-and-pack(TAP) problem, a frequently encountered instance of real-world packing, and develop a neural optimization solution based on reinforcement learning. Given an initial spatial configuration of boxes, we seek an efficient method to iteratively transport and pack the boxes compactly into a target container. Due to obstruction and accessibility constraints, our problem has to add a new search dimension, i.e., finding an optimal transport sequence, to the already immense search space for packing alone. Using a learning-based approach, a trained network can learn and encode solution patterns to guide the solution of new problem instances instead of executing an expensive online search. In our work, we represent the transport constraints using a precedence graph and train a neural network, coined TAP-Net, using reinforcement learning to reward efficient and stable packing. The network is built on an encoder-decoder architecture, where the encoder employs convolution layers to encode the box geometry and precedence graph and the decoder is a recurrent neural network (RNN) which inputs the current encoder output, as well as the current box packing state of the target container, and outputs the next box to pack, as well as its orientation. We train our network on randomly generated initial box configurations, without supervision, via policy gradients to learn optimal TAP policies to maximize packing efficiency and stability. We demonstrate the performance of TAP-Net on a variety of examples, evaluating the network through ablation studies and comparisons to baselines and alternative network designs. We also show that our network generalizes well to larger problem instances, when trained on small-sized inputs.

39.Dexterous Robotic Grasping with Object-Centric Visual Affordances ⬇️

Dexterous robotic hands are appealing for their agility and human-like morphology, yet their high degree of freedom makes learning to manipulate challenging. We introduce an approach for learning dexterous grasping. Our key idea is to embed an object-centric visual affordance model within a deep reinforcement learning loop to learn grasping policies that favor the same object regions favored by people. Unlike traditional approaches that learn from human demonstration trajectories (e.g., hand joint sequences captured with a glove), the proposed prior is object-centric and image-based, allowing the agent to anticipate useful affordance regions for objects unseen during policy learning. We demonstrate our idea with a 30-DoF five-fingered robotic hand simulator on 40 objects from two datasets, where it successfully and efficiently learns policies for stable grasps. Our affordance-guided policies are significantly more effective, generalize better to novel objects, and train 3 X faster than the baselines. Our work offers a step towards manipulation agents that learn by watching how people use objects, without requiring state and action information about the human body. Project website: this http URL

40.Real Image Super Resolution Via Heterogeneous Model using GP-NAS ⬇️

With advancement in deep neural network (DNN), recent state-of-the-art (SOTA) image superresolution (SR) methods have achieved impressive performance using deep residual network with dense skip connections. While these models perform well on benchmark dataset where low-resolution (LR) images are constructed from high-resolution (HR) references with known blur kernel, real image SR is more challenging when both images in the LR-HR pair are collected from real cameras. Based on existing dense residual networks, a Gaussian process based neural architecture search (GP-NAS) scheme is utilized to find candidate network architectures using a large search space by varying the number of dense residual blocks, the block size and the number of features. A suite of heterogeneous models with diverse network structure and hyperparameter are selected for model-ensemble to achieve outstanding performance in real image SR. The proposed method won the first place in all three tracks of the AIM 2020 Real Image Super-Resolution Challenge.

41.An Internal Cluster Validity Index Based on Distance-based Separability Measure ⬇️

To evaluate clustering results is a significant part in cluster analysis. Usually, there is no true class labels for clustering as a typical unsupervised learning. Thus, a number of internal evaluations, which use predicted labels and data, have been created. They also named internal cluster validity indices (CVIs). Without true labels, to design an effective CVI is not simple because it is similar to create a clustering method. And, to have more CVIs is crucial because there is no universal CVI that can be used to measure all datasets, and no specific method for selecting a proper CVI for clusters without true labels. Therefore, to apply more CVIs to evaluate clustering results is necessary. In this paper, we propose a novel CVI - called Distance-based Separability Index (DSI), based on a data separability measure. We applied the DSI and eight other internal CVIs including early studies from Dunn (1974) to most recent studies CVDD (2019) as comparison. We used an external CVI as ground truth for clustering results of five clustering algorithms on 12 real and 97 synthetic datasets. Results show DSI is an effective, unique, and competitive CVI to other compared CVIs. In addition, we summarized the general process to evaluate CVIs and created a new method - rank difference - to compare the results of CVIs.

42.When Image Decomposition Meets Deep Learning: A Novel Infrared and Visible Image Fusion Method ⬇️

Infrared and visible image fusion, as a hot topic in image processing and image enhancement, aims to produce fused images retaining the detail texture information in visible images and the thermal radiation information in infrared images. In this paper, we propose a novel two-stream auto-encoder (AE) based fusion network. The core idea is that the encoder decomposes an image into base and detail feature maps with low- and high-frequency information, respectively, and that the decoder is responsible for the original image reconstruction. To this end, a well-designed loss function is established to make the base/detail feature maps similar/dissimilar. In the test phase, base and detail feature maps are respectively merged via a fusion module, and the fused image is recovered by the decoder. Qualitative and quantitative results demonstrate that our method can generate fusion images containing highlighted targets and abundant detail texture information with strong reproducibility and meanwhile superior than the state-of-the-art (SOTA) approaches.

Files

20200904.md

Latest commit

History

20200904.md

File metadata and controls

ArXiv cs.CV --Fri, 4 Sep 2020

1.Flow-edge Guided Video Completion ⬇️

2.Computational Analysis of Deformable Manifolds: from Geometric Modelling to Deep Learning ⬇️

3.Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild ⬇️

4.MIPGAN -- Generating Robust and High QualityMorph Attacks Using Identity Prior Driven GAN ⬇️

5.Multi-Loss Weighting with Coefficient of Variations ⬇️

6.Future Frame Prediction of a Video Sequence ⬇️

7.Multi-domain semantic segmentation with pyramidal fusion ⬇️

8.Modification method for single-stage object detectors that allows to exploit the temporal behaviour of a scene to improve detection accuracy ⬇️

9.Few-shot Object Detection with Feature Attention Highlight Module in Remote Sensing Images ⬇️

10.SCG-Net: Self-Constructing Graph Neural Networks for Semantic Segmentation ⬇️

11.Layer-specific Optimization for Mixed Data Flow with Mixed Precision in FPGA Design for CNN-based Object Detectors ⬇️

12.DESC: Domain Adaptation for Depth Estimation via Semantic Consistency ⬇️

13.Auto-Classifier: A Robust Defect Detector Based on an AutoML Head ⬇️

14.1st Place Solution of LVIS Challenge 2020: A Good Box is not a Guarantee of a Good Mask ⬇️

15.Physics-based Shading Reconstruction for Intrinsic Image Decomposition ⬇️

16.A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports ⬇️

17.TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback ⬇️

18.Modeling Global Body Configurations in American Sign Language ⬇️

19.Adherent Mist and Raindrop Removal from a Single Image Using Attentive Convolutional Network ⬇️

20.Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding ⬇️

21.Tasks Integrated Networks: Joint Detection and Retrieval for Image Search ⬇️

22.Spatial Transformer Point Convolution ⬇️

23.Noise-Aware Texture-Preserving Low-Light Enhancement ⬇️

24.Towards Practical Implementations of Person Re-Identification from Full Video Frames ⬇️

25.NITES: A Non-Parametric Interpretable Texture Synthesis Method ⬇️

26.Robust Object Classification Approach using Spherical Harmonics ⬇️

27.Unsupervised Point Cloud Registration via Salient Points Analysis (SPA) ⬇️

28.Unsupervised Feedforward Feature (UFF) Learning for Point Cloud Classification and Segmentation ⬇️

29.Efficiency in Real-time Webcam Gaze Tracking ⬇️

30.CNN-Based Ultrasound Image Reconstruction for Ultrafast Displacement Tracking ⬇️

31.Limited View Tomographic Reconstruction Using a Deep Recurrent Framework with Residual Dense Spatial-Channel Attention Network and Sinogram Consistency ⬇️

32.Software Effort Estimation using parameter tuned Models ⬇️

33.Heightmap Reconstruction of Macula on Color Fundus Images Using Conditional Generative Adversarial Networks ⬇️

34.Multimodal brain tumor classification ⬇️

35.Detection-Aware Trajectory Generation for a Drone Cinematographer ⬇️

36.Fundus Image Analysis for Age Related Macular Degeneration: ADAM-2020 Challenge Report ⬇️

37.TopoMap: A 0-dimensional Homology Preserving Projection of High-Dimensional Data ⬇️

38.TAP-Net: Transport-and-Pack using Reinforcement Learning ⬇️

39.Dexterous Robotic Grasping with Object-Centric Visual Affordances ⬇️

40.Real Image Super Resolution Via Heterogeneous Model using GP-NAS ⬇️

41.An Internal Cluster Validity Index Based on Distance-based Separability Measure ⬇️

42.When Image Decomposition Meets Deep Learning: A Novel Infrared and Visible Image Fusion Method ⬇️