Skip to content

Latest commit

 

History

History
203 lines (203 loc) · 140 KB

20200929.md

File metadata and controls

203 lines (203 loc) · 140 KB

ArXiv cs.CV --Tue, 29 Sep 2020

1.Afro-MNIST: Synthetic generation of MNIST-style datasets for low-resource languages ⬇️

We present Afro-MNIST, a set of synthetic MNIST-style datasets for four orthographies used in Afro-Asiatic and Niger-Congo languages: Ge`ez (Ethiopic), Vai, Osmanya, and N'Ko. These datasets serve as "drop-in" replacements for MNIST. We also describe and open-source a method for synthetic MNIST-style dataset generation from single examples of each digit. These datasets can be found at this https URL. We hope that MNIST-style datasets will be developed for other numeral systems, and that these datasets vitalize machine learning education in underrepresented nations in the research community.

2.A complete character recognition and transliteration technique for Devanagari script ⬇️

Transliteration involves transformation of one script to another based on phonetic similarities between the characters of two distinctive scripts. In this paper, we present a novel technique for automatic transliteration of Devanagari script using character recognition. One of the first tasks performed to isolate the constituent characters is segmentation. Line segmentation methodology in this manuscript discusses the case of overlapping lines. Character segmentation algorithm is designed to segment conjuncts and separate shadow characters. Presented shadow character segmentation scheme employs connected component method to isolate the character, keeping the constituent characters intact. Statistical features namely different order moments like area, variance, skewness and kurtosis along with structural features of characters are employed in two phase recognition process. After recognition, constituent Devanagari characters are mapped to corresponding roman alphabets in way that resulting roman alphabets have similar pronunciation to source characters.

3.ConvSequential-SLAM: A Sequence-based, Training-less Visual Place Recognition Technique for Changing Environments ⬇️

Visual Place Recognition (VPR) is the ability to correctly recall a previously visited place under changing viewpoints and appearances. A large number of handcrafted and deep-learning-based VPR techniques exist, where the former suffer from appearance changes and the latter have significant computational needs. In this paper, we present a new handcrafted VPR technique that achieves state-of-the-art place matching performance under challenging conditions. Our technique combines the best of 2 existing trainingless VPR techniques, SeqSLAM and CoHOG, which are each robust to conditional and viewpoint changes, respectively. This blend, namely ConvSequential-SLAM, utilises sequential information and block-normalisation to handle appearance changes, while using regional-convolutional matching to achieve viewpoint-invariance. We analyse content-overlap in-between query frames to find a minimum sequence length, while also re-using the image entropy information for environment-based sequence length tuning. State-of-the-art performance is reported in contrast to 8 contemporary VPR techniques on 4 public datasets. Qualitative insights and an ablation study on sequence length are also provided.

4.Arabic Handwritten Character Recognition based on Convolution Neural Networks and Support Vector Machine ⬇️

Recognition of Arabic characters is essential for natural language processing and computer vision fields. The need to recognize and classify the handwritten Arabic letters and characters are essentially required. In this paper, we present an algorithm for recognizing Arabic letters and characters based on using deep convolution neural networks (DCNN) and support vector machine (SVM). This paper addresses the problem of recognizing the Arabic handwritten characters by determining the similarity between the input templates and the pre-stored templates using both fully connected DCNN and dropout SVM. Furthermore, this paper determines the correct classification rate (CRR) depends on the accuracy of the corrected classified templates, of the recognized handwritten Arabic characters. Moreover, we determine the error classification rate (ECR). The experimental results of this work indicate the ability of the proposed algorithm to recognize, identify, and verify the input handwritten Arabic characters. Furthermore, the proposed system determines similar Arabic characters using a clustering algorithm based on the K-means clustering approach to handle the problem of multi-stroke in Arabic characters. The comparative evaluation is stated and the system accuracy reached 95.07% CRR with 4.93% ECR compared with the state of the art.

5.Learning to Detect Objects with a 1 Megapixel Event Camera ⬇️

Event cameras encode visual information with high temporal precision, low data-rate, and high-dynamic range. Thanks to these characteristics, event cameras are particularly suited for scenarios with high motion, challenging lighting conditions and requiring low latency. However, due to the novelty of the field, the performance of event-based systems on many vision tasks is still lower compared to conventional frame-based solutions. The main reasons for this performance gap are: the lower spatial resolution of event sensors, compared to frame cameras; the lack of large-scale training datasets; the absence of well established deep learning architectures for event-based processing. In this paper, we address all these problems in the context of an event-based object detection task. First, we publicly release the first high-resolution large-scale dataset for object detection. The dataset contains more than 14 hours recordings of a 1 megapixel event camera, in automotive scenarios, together with 25M bounding boxes of cars, pedestrians, and two-wheelers, labeled at high frequency. Second, we introduce a novel recurrent architecture for event-based detection and a temporal consistency loss for better-behaved training. The ability to compactly represent the sequence of events into the internal memory of the model is essential to achieve high accuracy. Our model outperforms by a large margin feed-forward event-based architectures. Moreover, our method does not require any reconstruction of intensity images from events, showing that training directly from raw events is possible, more efficient, and more accurate than passing through an intermediate intensity image. Experiments on the dataset introduced in this work, for which events and gray level images are available, show performance on par with that of highly tuned and studied frame-based detectors.

6.A Study on Lip Localization Techniques used for Lip reading from a Video ⬇️

In this paper some of the different techniques used to localize the lips from the face are discussed and compared along with its processing steps. Lip localization is the basic step needed to read the lips for extracting visual information from the video input. The techniques could be applied on asymmetric lips and also on the mouth with visible teeth, tongue & mouth with moustache. In the process of Lip reading the following steps are generally used. They are, initially locating lips in the first frame of the video input, then tracking the lips in the following frames using the resulting pixel points of initial step and at last converting the tracked lip model to its corresponding matched letter to give the visual information. A new proposal is also initiated from the discussed techniques. The lip reading is useful in Automatic Speech Recognition when the audio is absent or present low with or without noise in the communication systems. Human Computer communication also will require speech recognition.

7.CAT STREET: Chronicle Archive of Tokyo Street-fashion ⬇️

The analysis of daily life fashion trends can help us understand our societies and human cultures profoundly. However, no appropriate database exists that includes images illustrating what people wore in their daily lives over an extended period. In this study, we propose a new fashion image archive, Chronicle Archive of Tokyo Street-fashion (CAT STREET), to shed light on daily life fashion trends. CAT STREET includes images showing what people wore in their daily lives during the period 1970-2017, and these images contain timestamps and street location annotations. This novel database enables us to observe long-term daily life fashion trends using quantitative methods. To evaluate the potential of our database, we corroborated the rules-of-thumb for two fashion trend phenomena, namely how economic conditions affect fashion style share in the long term and how fashion styles emerge in the street and diffuse from street to street. Our findings show that the Conservative style trend, a type of luxury fashion style, is affected by economic conditions. We also introduce four cases of how fashion styles emerge in the street and diffuse from street to street in fashion-conscious streets in Tokyo. Our study demonstrates CAT STREET's potential to promote understanding of societies and human cultures through quantitative analysis of daily life fashion trends.

8.RS-MetaNet: Deep meta metric learning for few-shot remote sensing scene classification ⬇️

Training a modern deep neural network on massive labeled samples is the main paradigm in solving the scene classification problem for remote sensing, but learning from only a few data points remains a challenge. Existing methods for few-shot remote sensing scene classification are performed in a sample-level manner, resulting in easy overfitting of learned features to individual samples and inadequate generalization of learned category segmentation surfaces. To solve this problem, learning should be organized at the task level rather than the sample level. Learning on tasks sampled from a task family can help tune learning algorithms to perform well on new tasks sampled in that family. Therefore, we propose a simple but effective method, called RS-MetaNet, to resolve the issues related to few-shot remote sensing scene classification in the real world. On the one hand, RS-MetaNet raises the level of learning from the sample to the task by organizing training in a meta way, and it learns to learn a metric space that can well classify remote sensing scenes from a series of tasks. We also propose a new loss function, called Balance Loss, which maximizes the generalization ability of the model to new samples by maximizing the distance between different categories, providing the scenes in different categories with better linear segmentation planes while ensuring model fit. The experimental results on three open and challenging remote sensing datasets, UCMerced_LandUse, NWPU-RESISC45, and Aerial Image Data, demonstrate that our proposed RS-MetaNet method achieves state-of-the-art results in cases where there are only 1-20 labeled samples.

9.Learning Category- and Instance-Aware Pixel Embedding for Fast Panoptic Segmentation ⬇️

Panoptic segmentation (PS) is a complex scene understanding task that requires providing high-quality segmentation for both thing objects and stuff regions. Previous methods handle these two classes with semantic and instance segmentation modules separately, following with heuristic fusion or additional modules to resolve the conflicts between the two outputs. This work simplifies this pipeline of PS by consistently modeling the two classes with a novel PS framework, which extends a detection model with an extra module to predict category- and instance-aware pixel embedding (CIAE). CIAE is a novel pixel-wise embedding feature that encodes both semantic-classification and instance-distinction information. At the inference process, PS results are simply derived by assigning each pixel to a detected instance or a stuff class according to the learned embedding. Our method not only shows fast inference speed but also the first one-stage method to achieve comparable performance to two-stage methods on the challenging COCO benchmark.

10.Weakly Supervised Deep Functional Map for Shape Matching ⬇️

A variety of deep functional maps have been proposed recently, from fully supervised to totally unsupervised, with a range of loss functions as well as different regularization terms. However, it is still not clear what are minimum ingredients of a deep functional map pipeline and whether such ingredients unify or generalize all recent work on deep functional maps. We show empirically minimum components for obtaining state of the art results with different loss functions, supervised as well as unsupervised. Furthermore, we propose a novel framework designed for both full-to-full as well as partial to full shape matching that achieves state of the art results on several benchmark datasets outperforming even the fully supervised methods by a significant margin. Our code is publicly available at this https URL

11.Addressing Class Imbalance in Scene Graph Parsing by Learning to Contrast and Score ⬇️

Scene graph parsing aims to detect objects in an image scene and recognize their relations. Recent approaches have achieved high average scores on some popular benchmarks, but fail in detecting rare relations, as the highly long-tailed distribution of data biases the learning towards frequent labels. Motivated by the fact that detecting these rare relations can be critical in real-world applications, this paper introduces a novel integrated framework of classification and ranking to resolve the class imbalance problem in scene graph parsing. Specifically, we design a new Contrasting Cross-Entropy loss, which promotes the detection of rare relations by suppressing incorrect frequent ones. Furthermore, we propose a novel scoring module, termed as Scorer, which learns to rank the relations based on the image features and relation features to improve the recall of predictions. Our framework is simple and effective, and can be incorporated into current scene graph models. Experimental results show that the proposed approach improves the current state-of-the-art methods, with a clear advantage of detecting rare relations.

12.EvolGAN: Evolutionary Generative Adversarial Networks ⬇️

We propose to use a quality estimator and evolutionary methods to search the latent space of generative adversarial networks trained on small, difficult datasets, or both. The new method leads to the generation of significantly higher quality images while preserving the original generator's diversity. Human raters preferred an image from the new version with frequency 83.7pc for Cats, 74pc for FashionGen, 70.4pc for Horses, and 69.2pc for Artworks, and minor improvements for the already excellent GANs for faces. This approach applies to any quality scorer and GAN generator.

13.Cuid: A new study of perceived image quality and its subjective assessment ⬇️

Research on image quality assessment (IQA) remains limited mainly due to our incomplete knowledge about human visual perception. Existing IQA algorithms have been designed or trained with insufficient subjective data with a small degree of stimulus variability. This has led to challenges for those algorithms to handle complexity and diversity of real-world digital content. Perceptual evidence from human subjects serves as a grounding for the development of advanced IQA algorithms. It is thus critical to acquire reliable subjective data with controlled perception experiments that faithfully reflect human behavioural responses to distortions in visual signals. In this paper, we present a new study of image quality perception where subjective ratings were collected in a controlled lab environment. We investigate how quality perception is affected by a combination of different categories of images and different types and levels of distortions. The database will be made publicly available to facilitate calibration and validation of IQA algorithms.

14.The Elements of End-to-end Deep Face Recognition: A Survey of Recent Advances ⬇️

Face recognition is one of the most fundamental and long-standing topics in computer vision community. With the recent developments of deep convolutional neural networks and large-scale datasets, deep face recognition has made remarkable progress and been widely used in the real-world applications. Given a natural image or video frame as input, an end-to-end deep face recognition system outputs the face feature for recognition. To achieve this, the whole system is generally built with three key elements: face detection, face preprocessing, and face representation. The face detection locates faces in the image or frame. Then, the face preprocessing is proceeded to calibrate the faces to a canonical view and crop them to a normalized pixel size. Finally, in the stage of face representation, the discriminative features are extracted from the preprocessed faces for recognition. All of the three elements are fulfilled by deep convolutional neural networks. In this paper, we present a comprehensive survey about the recent advances of every element of the end-to-end deep face recognition, since the thriving deep learning techniques have greatly improved the capability of them. To start with, we introduce an overview of the end-to-end deep face recognition, which, as mentioned above, includes face detection, face preprocessing, and face representation. Then, we review the deep learning based advances of each element, respectively, covering many aspects such as the up-to-date algorithm designs, evaluation metrics, datasets, performance comparison, existing challenges, and promising directions for future research. We hope this survey could bring helpful thoughts to one for better understanding of the big picture of end-to-end face recognition and deeper exploration in a systematic way.

15.Multi-scale Receptive Fields Graph Attention Network for Point Cloud Classification ⬇️

Understanding the implication of point cloud is still challenging to achieve the goal of classification or segmentation due to the irregular and sparse structure of point cloud. As we have known, PointNet architecture as a ground-breaking work for point cloud which can learn efficiently shape features directly on unordered 3D point cloud and have achieved favorable performance. However, this model fail to consider the fine-grained semantic information of local structure for point cloud. Afterwards, many valuable works are proposed to enhance the performance of PointNet by means of semantic features of local patch for point cloud. In this paper, a multi-scale receptive fields graph attention network (named after MRFGAT) for point cloud classification is proposed. By focusing on the local fine features of point cloud and applying multi attention modules based on channel affinity, the learned feature map for our network can well capture the abundant features information of point cloud. The proposed MRFGAT architecture is tested on ModelNet10 and ModelNet40 datasets, and results show it achieves state-of-the-art performance in shape classification tasks.

16.Learning to Adapt Multi-View Stereo by Self-Supervision ⬇️

3D scene reconstruction from multiple views is an important classical problem in computer vision. Deep learning based approaches have recently demonstrated impressive reconstruction results. When training such models, self-supervised methods are favourable since they do not rely on ground truth data which would be needed for supervised training and is often difficult to obtain. Moreover, learned multi-view stereo reconstruction is prone to environment changes and should robustly generalise to different domains. We propose an adaptive learning approach for multi-view stereo which trains a deep neural network for improved adaptability to new target domains. We use model-agnostic meta-learning (MAML) to train base parameters which, in turn, are adapted for multi-view stereo on new domains through self-supervised training. Our evaluations demonstrate that the proposed adaptation method is effective in learning self-supervised multi-view stereo reconstruction in new domains.

17.Driver Drowsiness Classification Based on Eye Blink and Head Movement Features Using the k-NN Algorithm ⬇️

Modern advanced driver-assistance systems analyze the driving performance to gather information about the driver's state. Such systems are able, for example, to detect signs of drowsiness by evaluating the steering or lane keeping behavior and to alert the driver when the drowsiness state reaches a critical level. However, these kinds of systems have no access to direct cues about the driver's state. Hence, the aim of this work is to extend the driver drowsiness detection in vehicles using signals of a driver monitoring camera. For this purpose, 35 features related to the driver's eye blinking behavior and head movements are extracted in driving simulator experiments. Based on that large dataset, we developed and evaluated a feature selection method based on the k-Nearest Neighbor algorithm for the driver's state classification. A concluding analysis of the best performing feature sets yields valuable insights about the influence of drowsiness on the driver's blink behavior and head movements. These findings will help in the future development of robust and reliable driver drowsiness monitoring systems to prevent fatigue-induced accidents.

18.Texture Memory-Augmented Deep Patch-Based Image Inpainting ⬇️

Patch-based methods and deep networks have been employed to tackle image inpainting problem, with their own strengths and weaknesses. Patch-based methods are capable of restoring a missing region with high-quality texture through searching nearest neighbor patches from the unmasked regions. However, these methods bring problematic contents when recovering large missing regions. Deep networks, on the other hand, show promising results in completing large regions. Nonetheless, the results often lack faithful and sharp details that resemble the surrounding area. By bringing together the best of both paradigms, we propose a new deep inpainting framework where texture generation is guided by a texture memory of patch samples extracted from unmasked regions. The framework has a novel design that allows texture memory retrieval to be trained end-to-end with the deep inpainting network. In addition, we introduce a patch distribution loss to encourage high-quality patch synthesis. The proposed method shows superior performance both qualitatively and quantitatively on three challenging image benchmarks, i.e., Places, CelebA-HQ, and Paris Street-View datasets.

19.Video Face Recognition System: RetinaFace-mnet-faster and Secondary Search ⬇️

Face recognition is widely used in the scene. However, different visual environments require different methods, and face recognition has a difficulty in complex environments. Therefore, this paper mainly experiments complex faces in the video. First, we design an image pre-processing module for fuzzy scene or under-exposed faces to enhance images. Our experimental results demonstrate that effective images pre-processing improves the accuracy of 0.11%, 0.2% and 1.4% on LFW, WIDER FACE and our datasets, respectively. Second, we propose RetinacFace-mnet-faster for detection and a confidence threshold specification for face recognition, reducing the lost rate. Our experimental results show that our RetinaFace-mnet-faster for 640*480 resolution on the Tesla P40 and single-thread improve speed of 16.7% and 70.2%, respectively. Finally, we design secondary search mechanism with HNSW to improve performance. Ours is suitable for large-scale datasets, and experimental results show that our method is 82% faster than the violent retrieval for the single-frame detection.

20.Trainable Structure Tensors for Autonomous Baggage Threat Detection Under Extreme Occlusion ⬇️

Detecting baggage threats is one of the most difficult tasks, even for expert officers. Many researchers have developed computer-aided screening systems to recognize these threats from the baggage X-ray scans. However, all of these frameworks are limited in recognizing contraband items under extreme occlusion. This paper presents a novel instance detector that utilizes a trainable structure tensor scheme to highlight the contours of the occluded and cluttered contraband items (obtained from multiple predominant orientations) while simultaneously suppressing all the other baggage content within the scan, leading to robust detection. The proposed framework has been rigorously tested on four publicly available X-ray datasets where it outperforms the state-of-the-art frameworks in terms of mean average precision scores. Furthermore, to the best of our knowledge, it is the only framework that has been rigorously tested on combined grayscale and colored scans obtained from four different types of X-ray scanners.

21.Segmentation and Analysis of a Sketched Truss Frame Using Morphological Image Processing Techniques ⬇️

Development of computational tools to analyze and assess the building capacities has had a major impact in civil engineering. The interaction with the structural software packages is becoming easier and the modeling tools are becoming smarter by automating the users role during their interaction with the software. One of the difficulties and the most time consuming steps involved in the structural modeling is defining the geometry of the structure to provide the analysis. This paper is dedicated to the development of a methodology to automate analysis of a hand sketched or computer generated truss frame drawn on a piece of paper. First, we focus on the segmentation methodologies for hand sketched truss components using the morphological image processing techniques, and then we provide a real time analysis of the truss. We visualize and augment the results on the input image to facilitate the public understanding of the truss geometry and internal forces. MATLAB is used as the programming language for the image processing purposes, and the truss is analyzed using Sap2000 API to integrate with MATLAB to provide a convenient structural analysis. This paper highlights the potential of the automation of the structural analysis using image processing to quickly assess the efficiency of structural systems. Further development of this framework is likely to revolutionize the way that structures are modeled and analyzed.

22.Interpretable Detail-Fidelity Attention Network for Single Image Super-Resolution ⬇️

Benefiting from the strong capabilities of deep CNNs for feature representation and nonlinear mapping, deep-learning-based methods have achieved excellent performance in single image super-resolution. However, most existing SR methods depend on the high capacity of networks which is initially designed for visual recognition, and rarely consider the initial intention of super-resolution for detail fidelity. Aiming at pursuing this intention, there are two challenging issues to be solved: (1) learning appropriate operators which is adaptive to the diverse characteristics of smoothes and details; (2) improving the ability of model to preserve the low-frequency smoothes and reconstruct the high-frequency details. To solve them, we propose a purposeful and interpretable detail-fidelity attention network to progressively process these smoothes and details in divide-and-conquer manner, which is a novel and specific prospect of image super-resolution for the purpose on improving the detail fidelity, instead of blindly designing or employing the deep CNNs architectures for merely feature representation in local receptive fields. Particularly, we propose a Hessian filtering for interpretable feature representation which is high-profile for detail inference, a dilated encoder-decoder and a distribution alignment cell to improve the inferred Hessian features in morphological manner and statistical manner respectively. Extensive experiments demonstrate that the proposed methods achieve superior performances over the state-of-the-art methods quantitatively and qualitatively. Code is available at this https URL.

23.RRPN++: Guidance Towards More Accurate Scene Text Detection ⬇️

RRPN is among the outstanding scene text detection approaches, but the manually-designed anchor and coarse proposal refinement make the performance still far from perfection. In this paper, we propose RRPN++ to exploit the potential of RRPN-based model by several improvements. Based on RRPN, we propose the Anchor-free Pyramid Proposal Networks (APPN) to generate first-stage proposals, which adopts the anchor-free design to reduce proposal number and accelerate the inference speed. In our second stage, both the detection branch and the recognition branch are incorporated to perform multi-task learning. In inference stage, the detection branch outputs the proposal refinement and the recognition branch predicts the transcript of the refined text region. Further, the recognition branch also helps rescore the proposals and eliminate the false positive proposals by the jointing filtering strategy. With these enhancements, we boost the detection results by $6%$ of F-measure in ICDAR2015 compared to RRPN. Experiments conducted on other benchmarks also illustrate the superior performance and efficiency of our model.

24.Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation ⬇️

Vision-and-Language Navigation (VLN) is a natural language grounding task where an agent learns to follow language instructions and navigate to specified destinations in real-world environments. A key challenge is to recognize and stop at the correct location, especially for complicated outdoor environments. Existing methods treat the STOP action equally as other actions, which results in undesirable behaviors that the agent often fails to stop at the destination even though it might be on the right path. Therefore, we propose Learning to Stop (L2Stop), a simple yet effective policy module that differentiates STOP and other actions. Our approach achieves the new state of the art on a challenging urban VLN dataset Touchdown, outperforming the baseline by 6.89% (absolute improvement) on Success weighted by Edit Distance (SED).

25.NITI: Training Integer Neural Networks Using Integer-only Arithmetic ⬇️

While integer arithmetic has been widely adopted for improved performance in deep quantized neural network inference, training remains a task primarily executed using floating point arithmetic. This is because both high dynamic range and numerical accuracy are central to the success of most modern training algorithms. However, due to its potential for computational, storage and energy advantages in hardware accelerators, neural network training methods that can be implemented with low precision integer-only arithmetic remains an active research challenge. In this paper, we present NITI, an efficient deep neural network training framework that stores all parameters and intermediate values as integers, and computes exclusively with integer arithmetic. A pseudo stochastic rounding scheme that eliminates the need for external random number generation is proposed to facilitate conversion from wider intermediate results to low precision storage. Furthermore, a cross-entropy loss backpropagation scheme computed with integer-only arithmetic is proposed. A proof-of-concept open-source software implementation of NITI that utilizes native 8-bit integer operations in modern GPUs to achieve end-to-end training is presented. When compared with an equivalent training setup implemented with floating point storage and arithmetic, NITI achieves negligible accuracy degradation on the MNIST and CIFAR10 datasets using 8-bit integer storage and computation. On ImageNet, 16-bit integers are needed for weight accumulation with an 8-bit datapath. This achieves training results comparable to all-floating-point implementations.

26.PERF-Net: Pose Empowered RGB-Flow Net ⬇️

In recent years, many works in the video action recognition literature have shown that two stream models (combining spatial and temporal input streams) are necessary for achieving state of the art performance. In this paper we show the benefits of including yet another stream based on human pose estimated from each frame -- specifically by rendering pose on input RGB frames. At first blush, this additional stream may seem redundant given that human pose is fully determined by RGB pixel values -- however we show (perhaps surprisingly) that this simple and flexible addition can provide complementary gains. Using this insight, we then propose a new model, which we dub PERF-Net (short for Pose Empowered RGB-Flow Net), which combines this new pose stream with the standard RGB and flow based input streams via distillation techniques and show that our model outperforms the state-of-the-art by a large margin in a number of human action recognition datasets while not requiring flow or pose to be explicitly computed at inference time.

27.Distribution Matching for Crowd Counting ⬇️

In crowd counting, each training image contains multiple people, where each person is annotated by a dot. Existing crowd counting methods need to use a Gaussian to smooth each annotated dot or to estimate the likelihood of every pixel given the annotated point. In this paper, we show that imposing Gaussians to annotations hurts generalization performance. Instead, we propose to use Distribution Matching for crowd COUNTing (DM-Count). In DM-Count, we use Optimal Transport (OT) to measure the similarity between the normalized predicted density map and the normalized ground truth density map. To stabilize OT computation, we include a Total Variation loss in our model. We show that the generalization error bound of DM-Count is tighter than that of the Gaussian smoothed methods. In terms of Mean Absolute Error, DM-Count outperforms the previous state-of-the-art methods by a large margin on two large-scale counting datasets, UCF-QNRF and NWPU, and achieves the state-of-the-art results on the ShanghaiTech and UCF-CC50 datasets. Notably, DM-Count ranked first on the leaderboard for the NWPU benchmark, reducing the error of the state-of-the-art published result by approximately 16%. Code is available at this https URL.

28.Semi-Supervised Image Deraining using Gaussian Processes ⬇️

Recent CNN-based methods for image deraining have achieved excellent performance in terms of reconstruction error as well as visual quality. However, these methods are limited in the sense that they can be trained only on fully labeled data. Due to various challenges in obtaining real world fully-labeled image deraining datasets, existing methods are trained only on synthetically generated data and hence, generalize poorly to real-world images. The use of real-world data in training image deraining networks is relatively less explored in the literature. We propose a Gaussian Process-based semi-supervised learning framework which enables the network in learning to derain using synthetic dataset while generalizing better using unlabeled real-world images. More specifically, we model the latent space vectors of unlabeled data using Gaussian Processes, which is then used to compute pseudo-ground-truth for supervising the network on unlabeled data. Through extensive experiments and ablations on several challenging datasets (such as Rain800, Rain200L and DDN-SIRR), we show that the proposed method is able to effectively leverage unlabeled data thereby resulting in significantly better performance as compared to labeled-only training. Additionally, we demonstrate that using unlabeled real-world images in the proposed GP-based framework results

29.Rotated Binary Neural Network ⬇️

Binary Neural Network (BNN) shows its predominance in reducing the complexity of deep neural networks. However, it suffers severe performance degradation. One of the major impediments is the large quantization error between the full-precision weight vector and its binary vector. Previous works focus on compensating for the norm gap while leaving the angular bias hardly touched. In this paper, for the first time, we explore the influence of angular bias on the quantization error and then introduce a Rotated Binary Neural Network (RBNN), which considers the angle alignment between the full-precision weight vector and its binarized version. At the beginning of each training epoch, we propose to rotate the full-precision weight vector to its binary vector to reduce the angular bias. To avoid the high complexity of learning a large rotation matrix, we further introduce a bi-rotation formulation that learns two smaller rotation matrices. In the training stage, we devise an adjustable rotated weight vector for binarization to escape the potential local optimum. Our rotation leads to around 50% weight flips which maximize the information gain. Finally, we propose a training-aware approximation of the sign function for the gradient backward. Experiments on CIFAR-10 and ImageNet demonstrate the superiorities of RBNN over many state-of-the-arts. Our source code, experimental settings, training logs and binary models are available at this https URL.

30.Event-based Action Recognition Using Timestamp Image Encoding Network ⬇️

Event camera is an asynchronous, high frequency vision sensor with low power consumption, which is suitable for human action recognition task. It is vital to encode the spatial-temporal information of event data properly and use standard computer vision tool to learn from the data. In this work, we propose a timestamp image encoding 2D network, which takes the encoded spatial-temporal images of the event data as input and output the action label. Experiment results show that our method can achieve the same level of performance as those RGB-based benchmarks on real world action recognition, and also achieve the SOTA result on gesture recognition.

31.Kernel Based Progressive Distillation for Adder Neural Networks ⬇️

Adder Neural Networks (ANNs) which only contain additions bring us a new way of developing deep neural networks with low energy consumption. Unfortunately, there is an accuracy drop when replacing all convolution filters by adder filters. The main reason here is the optimization difficulty of ANNs using $\ell_1$-norm, in which the estimation of gradient in back propagation is inaccurate. In this paper, we present a novel method for further improving the performance of ANNs without increasing the trainable parameters via a progressive kernel based knowledge distillation (PKKD) method. A convolutional neural network (CNN) with the same architecture is simultaneously initialized and trained as a teacher network, features and weights of ANN and CNN will be transformed to a new space to eliminate the accuracy drop. The similarity is conducted in a higher-dimensional space to disentangle the difference of their distributions using a kernel based method. Finally, the desired ANN is learned based on the information from both the ground-truth and teacher, progressively. The effectiveness of the proposed method for learning ANN with higher performance is then well-verified on several benchmarks. For instance, the ANN-50 trained using the proposed PKKD method obtains a 76.8% top-1 accuracy on ImageNet dataset, which is 0.6% higher than that of the ResNet-50.

32.Concentrated Multi-Grained Multi-Attention Network for Video Based Person Re-Identification ⬇️

Occlusion is still a severe problem in the video-based Re-IDentification (Re-ID) task, which has a great impact on the success rate. The attention mechanism has been proved to be helpful in solving the occlusion problem by a large number of existing methods. However, their attention mechanisms still lack the capability to extract sufficient discriminative information into the final representations from the videos. The single attention module scheme employed by existing methods cannot exploit multi-scale spatial cues, and the attention of the single module will be dispersed by multiple salient parts of the person. In this paper, we propose a Concentrated Multi-grained Multi-Attention Network (CMMANet) where two multi-attention modules are designed to extract multi-grained information through processing multi-scale intermediate features. Furthermore, multiple attention submodules in each multi-attention module can automatically discover multiple discriminative regions of the video frames. To achieve this goal, we introduce a diversity loss to diversify the submodules in each multi-attention module, and a concentration loss to integrate their attention responses so that each submodule can strongly focus on a specific meaningful part. The experimental results show that the proposed approach outperforms the state-of-the-art methods by large margins on multiple public datasets.

33.Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect ⬇️

As the class size grows, maintaining a balanced dataset across many classes is challenging because the data are long-tailed in nature; it is even impossible when the sample-of-interest co-exists with each other in one collectable unit, e.g., multiple visual instances in one image. Therefore, long-tailed classification is the key to deep learning at scale. However, existing methods are mainly based on re-weighting/re-sampling heuristics that lack a fundamental theory. In this paper, we establish a causal inference framework, which not only unravels the whys of previous methods, but also derives a new principled solution. Specifically, our theory shows that the SGD momentum is essentially a confounder in long-tailed classification. On one hand, it has a harmful causal effect that misleads the tail prediction biased towards the head. On the other hand, its induced mediation also benefits the representation learning and head prediction. Our framework elegantly disentangles the paradoxical effects of the momentum, by pursuing the direct causal effect caused by an input sample. In particular, we use causal intervention in training, and counterfactual reasoning in inference, to remove the "bad" while keep the "good". We achieve new state-of-the-arts on three long-tailed visual recognition benchmarks: Long-tailed CIFAR-10/-100, ImageNet-LT for image classification and LVIS for instance segmentation.

34.AIM 2020 Challenge on Video Temporal Super-Resolution ⬇️

Videos in the real-world contain various dynamics and motions that may look unnaturally discontinuous in time when the recordedframe rate is low. This paper reports the second AIM challenge on Video Temporal Super-Resolution (VTSR), a.k.a. frame interpolation, with a focus on the proposed solutions, results, and analysis. From low-frame-rate (15 fps) videos, the challenge participants are required to submit higher-frame-rate (30 and 60 fps) sequences by estimating temporally intermediate frames. To simulate realistic and challenging dynamics in the real-world, we employ the REDS_VTSR dataset derived from diverse videos captured in a hand-held camera for training and evaluation purposes. There have been 68 registered participants in the competition, and 5 teams (one withdrawn) have competed in the final testing phase. The winning team proposes the enhanced quadratic video interpolation method and achieves state-of-the-art on the VTSR task.

35.Recognition and Synthesis of Object Transport Motion ⬇️

Deep learning typically requires vast numbers of training examples in order to be used successfully. Conversely, motion capture data is often expensive to generate, requiring specialist equipment, along with actors to generate the prescribed motions, meaning that motion capture datasets tend to be relatively small. Motion capture data does however provide a rich source of information that is becoming increasingly useful in a wide variety of applications, from gesture recognition in human-robot interaction, to data driven animation.
This project illustrates how deep convolutional networks can be used, alongside specialized data augmentation techniques, on a small motion capture dataset to learn detailed information from sequences of a specific type of motion (object transport). The project shows how these same augmentation techniques can be scaled up for use in the more complex task of motion synthesis.
By exploring recent developments in the concept of Generative Adversarial Models (GANs), specifically the Wasserstein GAN, this project outlines a model that is able to successfully generate lifelike object transportation motions, with the generated samples displaying varying styles and transport strategies.

36.Human-Object Interaction Detection:A Quick Survey and Examination of Methods ⬇️

Human-object interaction detection is a relatively new task in the world of computer vision and visual semantic information extraction. With the goal of machines identifying interactions that humans perform on objects, there are many real-world use cases for the research in this field. To our knowledge, this is the first general survey of the state-of-the-art and milestone works in this field. We provide a basic survey of the developments in the field of human-object interaction detection. Many works in this field use multi-stream convolutional neural network architectures, which combine features from multiple sources in the input image. Most commonly these are the humans and objects in question, as well as the spatial quality of the two. As far as we are aware, there have not been in-depth studies performed that look into the performance of each component individually. In order to provide insight to future researchers, we perform an individualized study that examines the performance of each component of a multi-stream convolutional neural network architecture for human-object interaction detection. Specifically, we examine the HORCNN architecture as it is a foundational work in the field. In addition, we provide an in-depth look at the HICO-DET dataset, a popular benchmark in the field of human-object interaction detection. Code and papers can be found at this https URL.

37.A Survey on Deep Learning Methods for Semantic Image Segmentation in Real-Time ⬇️

Semantic image segmentation is one of fastest growing areas in computer vision with a variety of applications. In many areas, such as robotics and autonomous vehicles, semantic image segmentation is crucial, since it provides the necessary context for actions to be taken based on a scene understanding at the pixel level. Moreover, the success of medical diagnosis and treatment relies on the extremely accurate understanding of the data under consideration and semantic image segmentation is one of the important tools in many cases. Recent developments in deep learning have provided a host of tools to tackle this problem efficiently and with increased accuracy. This work provides a comprehensive analysis of state-of-the-art deep learning architectures in image segmentation and, more importantly, an extensive list of techniques to achieve fast inference and computational efficiency. The origins of these techniques as well as their strengths and trade-offs are discussed with an in-depth analysis of their impact in the area. The best-performing architectures are summarized with a list of methods used to achieve these state-of-the-art results.

38.Two-stream Encoder-Decoder Network for Localizing Image Forgeries ⬇️

This paper proposes a novel two-stream encoder-decoder network, which utilizes both the high-level and the low-level image features for precisely localizing forged regions in a manipulated image. This is motivated from the fact that the forgery creation process generally introduces both the high-level artefacts (e.g. unnatural contrast) and the low-level artefacts (e.g. noise inconsistency) to the forged images. In the proposed two-stream network, one stream learns the low-level manipulation-related features in the encoder side by extracting noise residuals through a set of high-pass filters in the first layer of the encoder network. In the second stream, the encoder learns the high-level image manipulation features from the input image RGB values. The coarse feature maps of both the encoders are upsampled by their corresponding decoder network to produce dense feature maps. The dense feature maps of the two streams are concatenated and fed to a final convolutional layer with sigmoidal activation to produce pixel-wise prediction. We have carried out experimental analysis on multiple standard forensics datasets to evaluate the performance of the proposed method. The experimental results show the efficacy of the proposed method with respect to the state-of-the-art.

39.Virtual Experience to Real World Application: Sidewalk Obstacle Avoidance Using Reinforcement Learning for Visually Impaired ⬇️

Finding a path free from obstacles that poses minimal risk is critical for safe navigation. People who are sighted and people who are visually impaired require navigation safety while walking on a sidewalk. In this research we developed an assistive navigation on a sidewalk by integrating sensory inputs using reinforcement learning. We trained a Sidewalk Obstacle Avoidance Agent (SOAA) through reinforcement learning in a simulated robotic environment. A Sidewalk Obstacle Conversational Agent (SOCA) is built by training a natural language conversation agent with real conversation data. The SOAA along with SOCA was integrated in a prototype device called augmented guide (AG). Empirical analysis showed that this prototype improved the obstacle avoidance experience about 5% from a base case of 81.29%

40.Adaptive confidence thresholding for semi-supervised monocular depth estimation ⬇️

Self-supervised monocular depth estimation has become an appealing solution to the lack of ground truth labels, but its reconstruction loss often produces over-smoothed results across object boundaries and is incapable of handling occlusion explicitly. In this paper, we propose a new approach to leverage pseudo ground truth depth maps of stereo images generated from pretrained stereo matching methods. Our method is comprised of three subnetworks; monocular depth network, confidence network, and threshold network. The confidence map of the pseudo ground truth depth map is first estimated to mitigate performance degeneration by inaccurate pseudo depth maps. To cope with the prediction error of the confidence map itself, we also propose to leverage the threshold network that learns the threshold {\tau} in an adaptive manner. The confidence map is thresholded via a differentiable soft-thresholding operator using this truncation boundary {\tau}. The pseudo depth labels filtered out by the thresholded confidence map are finally used to supervise the monocular depth network. To apply the proposed method to various training dataset, we introduce the network-wise training strategy that transfers the knowledge learned from one dataset to another. Experimental results demonstrate superior performance to state-of-the-art monocular depth estimation methods. Lastly, we exhibit that the threshold network can also be used to improve the performance of existing confidence estimation approaches.

41.Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization ⬇️

Recently, we have witnessed great progress in the field of medical imaging classification by adopting deep neural networks. However, the recent advanced models still require accessing sufficiently large and representative datasets for training, which is often unfeasible in clinically realistic environments. When trained on limited datasets, the deep neural network is lack of generalization capability, as the trained deep neural network on data within a certain distribution (e.g. the data captured by a certain device vendor or patient population) may not be able to generalize to the data with another distribution.
In this paper, we introduce a simple but effective approach to improve the generalization capability of deep neural networks in the field of medical imaging classification. Motivated by the observation that the domain variability of the medical images is to some extent compact, we propose to learn a representative feature space through variational encoding with a novel linear-dependency regularization term to capture the shareable information among medical data collected from different domains. As a result, the trained neural network is expected to equip with better generalization capability to the "unseen" medical data. Experimental results on two challenging medical imaging classification tasks indicate that our method can achieve better cross-domain generalization capability compared with state-of-the-art baselines.

42.AIM 2020: Scene Relighting and Illumination Estimation Challenge ⬇️

We review the AIM 2020 challenge on virtual image relighting and illumination estimation. This paper presents the novel VIDIT dataset used in the challenge and the different proposed solutions and final evaluation results over the 3 challenge tracks. The first track considered one-to-one relighting; the objective was to relight an input photo of a scene with a different color temperature and illuminant orientation (i.e., light source position). The goal of the second track was to estimate illumination settings, namely the color temperature and orientation, from a given image. Lastly, the third track dealt with any-to-any relighting, thus a generalization of the first track. The target color temperature and orientation, rather than being pre-determined, are instead given by a guide image. Participants were allowed to make use of their track 1 and 2 solutions for track 3. The tracks had 94, 52, and 56 registered participants, respectively, leading to 20 confirmed submissions in the final competition stage.

43.Semi-Supervised Learning for In-Game Expert-Level Music-to-Dance Translation ⬇️

Music-to-dance translation is a brand-new and powerful feature in recent role-playing games. Players can now let their characters dance along with specified music clips and even generate fan-made dance videos. Previous works of this topic consider music-to-dance as a supervised motion generation problem based on time-series data. However, these methods suffer from limited training data pairs and the degradation of movements. This paper provides a new perspective for this task where we re-formulate the translation problem as a piece-wise dance phrase retrieval problem based on the choreography theory. With such a design, players are allowed to further edit the dance movements on top of our generation while other regression based methods ignore such user interactivity. Considering that the dance motion capture is an expensive and time-consuming procedure which requires the assistance of professional dancers, we train our method under a semi-supervised learning framework with a large unlabeled dataset (20x than labeled data) collected. A co-ascent mechanism is introduced to improve the robustness of our network. Using this unlabeled dataset, we also introduce self-supervised pre-training so that the translator can understand the melody, rhythm, and other components of music phrases. We show that the pre-training significantly improves the translation accuracy than that of training from scratch. Experimental results suggest that our method not only generalizes well over various styles of music but also succeeds in expert-level choreography for game players.

44.Handwriting Prediction Considering Inter-Class Bifurcation Structures ⬇️

Temporal prediction is a still difficult task due to the chaotic behavior, non-Markovian characteristics, and non-stationary noise of temporal signals. Handwriting prediction is also challenging because of uncertainty arising from inter-class bifurcation structures, in addition to the above problems. For example, the classes '0' and '6' are very similar in terms of their beginning parts; therefore it is nearly impossible to predict their subsequent parts from the beginning part. In other words, '0' and '6' have a bifurcation structure due to ambiguity between classes, and we cannot make a long-term prediction in this context. In this paper, we propose a temporal prediction model that can deal with this bifurcation structure. Specifically, the proposed model learns the bifurcation structure explicitly as a Gaussian mixture model (GMM) for each class as well as the posterior probability of the classes. The final result of prediction is represented as the weighted sum of GMMs using the class probabilities as weights. When multiple classes have large weights, the model can handle a bifurcation and thus avoid an inaccurate prediction. The proposed model is formulated as a neural network including long short-term memories and is thus trained in an end-to-end manner. The proposed model was evaluated on the UNIPEN online handwritten character dataset, and the results show that the model can catch and deal with the bifurcation structures.

45.MicroAnalyzer: A Python Tool for Automated Bacterial Analysis with Fluorescence Microscopy ⬇️

Fluorescence microscopy is a widely used method among cell biologists for studying the localization and co-localization of fluorescent protein. For microbial cell biologists, these studies often include tedious and time-consuming manual segmentation of bacteria and of the fluorescence clusters or working with multiple programs. Here, we present MicroAnalyzer - a tool that automates these tasks by providing an end-to-end platform for microscope image analysis. While such tools do exist, they are costly, black-boxed programs. Microanalyzer offers an open-source alternative to these tools, allowing flexibility and expandability by advanced users. MicroAnalyzer provides accurate cell and fluorescence cluster segmentation based on state-of-the-art deep-learning segmentation models, combined with ad-hoc post-processing and Colicoords - an open-source cell image analysis tool for calculating general cell and fluorescence measurements. Using these methods, it performs better than generic approaches since the dynamic nature of neural networks allows for a quick adaptation to experiment restrictions and assumptions. Other existing tools do not consider experiment assumptions, nor do they provide fluorescence cluster detection without the need for any specialized equipment. The key goal of MicroAnalyzer is to automate the entire process of cell and fluorescence image analysis "from microscope to database", meaning it does not require any further input from the researcher except for the initial deep-learning model training. In this fashion, it allows the researchers to concentrate on the bigger picture instead of granular, eye-straining labor

46.I Like to Move It: 6D Pose Estimation as an Action Decision Process ⬇️

Object pose estimation is an integral part of robot vision and augmented reality. Robust and accurate pose prediction of both object rotation and translation is a crucial element to enable precise and safe human-machine interactions and to allow visualization in mixed reality. Previous 6D pose estimation methods treat the problem either as a regression task or discretize the pose space to classify. We reformulate the problem as an action decision process where an initial pose is updated in incremental discrete steps that sequentially move a virtual 3D rendering towards the correct solution. A neural network estimates likely moves from a single RGB image iteratively and determines so an acceptable final pose. In comparison to previous approaches that learn an object-specific pose embedding, a decision process allows for a lightweight architecture while it naturally generalizes to unseen objects. Moreover, the coherent action for process termination enables dynamic reduction of the computation cost if there are insignificant changes in a video sequence. While other methods only provide a static inference time, we can thereby automatically increase the runtime depending on the object motion. We evaluate robustness and accuracy of our action decision network on video scenes with known and unknown objects and show how this can improve the state-of-the-art on YCB videos significantly.

47.Enhancing a Neurocognitive Shared Visuomotor Model for Object Identification, Localization, and Grasping With Learning From Auxiliary Tasks ⬇️

We present a follow-up study on our unified visuomotor neural model for the robotic tasks of identifying, localizing, and grasping a target object in a scene with multiple objects. Our Retinanet-based model enables end-to-end training of visuomotor abilities in a biologically inspired developmental approach. In our initial implementation, a neural model was able to grasp selected objects from a planar surface. We embodied the model on the NICO humanoid robot. In this follow-up study, we expand the task and the model to reaching for objects in a three-dimensional space with a novel dataset based on augmented reality and a simulation environment. We evaluate the influence of training with auxiliary tasks, i.e., if learning of the primary visuomotor task is supported by learning to classify and locate different objects. We show that the proposed visuomotor model can learn to reach for objects in a three-dimensional space. We analyze the results for biologically-plausible biases based on object locations or properties. We show that the primary visuomotor task can be successfully trained simultaneously with one of the two auxiliary tasks. This is enabled by a complex neurocognitive model with shared and task-specific components, similar to models found in biological systems.

48.Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks ⬇️

Multispectral images (e.g. visible and infrared) may be particularly useful when detecting objects with the same model in different environments (e.g. day/night outdoor scenes). To effectively use the different spectra, the main technical problem resides in the information fusion process. In this paper, we propose a new halfway feature fusion method for neural networks that leverages the complementary/consistency balance existing in multispectral features by adding to the network architecture, a particular module that cyclically fuses and refines each spectral feature. We evaluate the effectiveness of our fusion method on two challenging multispectral datasets for object detection. Our results show that implementing our Cyclic Fuse-and-Refine module in any network improves the performance on both datasets compared to other state-of-the-art multispectral object detection methods.

49.Interactive White Balancing for Camera-Rendered Images ⬇️

White balance (WB) is one of the first photo-finishing steps used to render a captured image to its final output. WB is applied to remove the color cast caused by the scene's illumination. Interactive photo-editing software allows users to manually select different regions in a photo as examples of the illumination for WB correction (e.g., clicking on achromatic objects). Such interactive editing is possible only with images saved in a RAW image format. This is because RAW images have no photo-rendering operations applied and photo-editing software is able to apply WB and other photo-finishing procedures to render the final image. Interactively editing WB in camera-rendered images is significantly more challenging. This is because the camera hardware has already applied WB to the image and subsequent nonlinear photo-processing routines. These nonlinear rendering operations make it difficult to change the WB post-capture. The goal of this paper is to allow interactive WB manipulation of camera-rendered images. The proposed method is an extension of our recent work \cite{afifi2019color} that proposed a post-capture method for WB correction based on nonlinear color-mapping functions. Here, we introduce a new framework that links the nonlinear color-mapping functions directly to user-selected colors to enable {\it interactive} WB manipulation. This new framework is also more efficient in terms of memory and run-time (99% reduction in memory and 3$\times$ speed-up). Lastly, we describe how our framework can leverage a simple illumination estimation method (i.e., gray-world) to perform auto-WB correction that is on a par with the WB correction results in \cite{afifi2019color}. The source code is publicly available at this https URL.

50.Grasp Proposal Networks: An End-to-End Solution for Visual Learning of Robotic Grasps ⬇️

Learning robotic grasps from visual observations is a promising yet challenging task. Recent research shows its great potential by preparing and learning from large-scale synthetic datasets. For the popular, 6 degree-of-freedom (6-DOF) grasp setting of parallel-jaw gripper, most of existing methods take the strategy of heuristically sampling grasp candidates and then evaluating them using learned scoring functions. This strategy is limited in terms of the conflict between sampling efficiency and coverage of optimal grasps. To this end, we propose in this work a novel, end-to-end \emph{Grasp Proposal Network (GPNet)}, to predict a diverse set of 6-DOF grasps for an unseen object observed from a single and unknown camera view. GPNet builds on a key design of grasp proposal module that defines \emph{anchors of grasp centers} at discrete but regular 3D grid corners, which is flexible to support either more precise or more diverse grasp predictions. To test GPNet, we contribute a synthetic dataset of 6-DOF object grasps; evaluation is conducted using rule-based criteria, simulation test, and real test. Comparative results show the advantage of our methods over existing ones. Notably, GPNet gains better simulation results via the specified coverage, which helps achieve a ready translation in real test. We will make our dataset publicly available.

51.Few-shot Object Detection with Self-adaptive Attention Network for Remote Sensing Images ⬇️

In remote sensing field, there are many applications of object detection in recent years, which demands a great number of labeled data. However, we may be faced with some cases where only limited data are available. In this paper, we proposed a few-shot object detector which is designed for detecting novel objects provided with only a few examples. Particularly, in order to fit the object detection settings, our proposed few-shot detector concentrates on the relations that lie in the level of objects instead of the full image with the assistance of Self-Adaptive Attention Network (SAAN). The SAAN can fully leverage the object-level relations through a relation GRU unit and simultaneously attach attention on object features in a self-adaptive way according to the object-level relations to avoid some situations where the additional attention is useless or even detrimental. Eventually, the detection results are produced from the features that are added with attention and thus are able to be detected simply. The experiments demonstrate the effectiveness of the proposed method in few-shot scenes.

52.A Few-shot Learning Approach for Historical Ciphered Manuscript Recognition ⬇️

Encoded (or ciphered) manuscripts are a special type of historical documents that contain encrypted text. The automatic recognition of this kind of documents is challenging because: 1) the cipher alphabet changes from one document to another, 2) there is a lack of annotated corpus for training and 3) touching symbols make the symbol segmentation difficult and complex. To overcome these difficulties, we propose a novel method for handwritten ciphers recognition based on few-shot object detection. Our method first detects all symbols of a given alphabet in a line image, and then a decoding step maps the symbol similarity scores to the final sequence of transcribed symbols. By training on synthetic data, we show that the proposed architecture is able to recognize handwritten ciphers with unseen alphabets. In addition, if few labeled pages with the same alphabet are used for fine tuning, our method surpasses existing unsupervised and supervised HTR methods for ciphers recognition.

53.DT-Net: A novel network based on multi-directional integrated convolution and threshold convolution ⬇️

Since medical image data sets contain few samples and singular features, lesions are viewed as highly similar to other tissues. The traditional neural network has a limited ability to learn features. Even if a host of feature maps is expanded to obtain more semantic information, the accuracy of segmenting the final medical image is slightly improved, and the features are excessively redundant. To solve the above problems, in this paper, we propose a novel end-to-end semantic segmentation algorithm, DT-Net, and use two new convolution strategies to better achieve end-to-end semantic segmentation of medical images. 1. In the feature mining and feature fusion stage, we construct a multi-directional integrated convolution (MDIC). The core idea is to use the multi-scale convolution to enhance the local multi-directional feature maps to generate enhanced feature maps and to mine the generated features that contain more semantics without increasing the number of feature maps. 2. We also aim to further excavate and retain more meaningful deep features reduce a host of noise features in the training process. Therefore, we propose a convolution thresholding strategy. The central idea is to set a threshold to eliminate a large number of redundant features and reduce computational complexity. Through the two strategies proposed above, the algorithm proposed in this paper produces state-of-the-art results on two public medical image datasets. We prove in detail that our proposed strategy plays an important role in feature mining and eliminating redundant features. Compared with the existing semantic segmentation algorithms, our proposed algorithm has better robustness.

54.Affinity Space Adaptation for Semantic Segmentation Across Domains ⬇️

Semantic segmentation with dense pixel-wise annotation has achieved excellent performance thanks to deep learning. However, the generalization of semantic segmentation in the wild remains challenging. In this paper, we address the problem of unsupervised domain adaptation (UDA) in semantic segmentation. Motivated by the fact that source and target domain have invariant semantic structures, we propose to exploit such invariance across domains by leveraging co-occurring patterns between pairwise pixels in the output of structured semantic segmentation. This is different from most existing approaches that attempt to adapt domains based on individual pixel-wise information in image, feature, or output level. Specifically, we perform domain adaptation on the affinity relationship between adjacent pixels termed affinity space of source and target domain. To this end, we develop two affinity space adaptation strategies: affinity space cleaning and adversarial affinity space alignment. Extensive experiments demonstrate that the proposed method achieves superior performance against some state-of-the-art methods on several challenging benchmarks for semantic segmentation across domains. The code is available at this https URL.

55.Causal Intervention for Weakly-Supervised Semantic Segmentation ⬇️

We present a causal inference framework to improve Weakly-Supervised Semantic Segmentation (WSSS). Specifically, we aim to generate better pixel-level pseudo-masks by using only image-level labels -- the most crucial step in WSSS. We attribute the cause of the ambiguous boundaries of pseudo-masks to the confounding context, e.g., the correct image-level classification of "horse" and "person" may be not only due to the recognition of each instance, but also their co-occurrence context, making the model inspection (e.g., CAM) hard to distinguish between the boundaries. Inspired by this, we propose a structural causal model to analyze the causalities among images, contexts, and class labels. Based on it, we develop a new method: Context Adjustment (CONTA), to remove the confounding bias in image-level classification and thus provide better pseudo-masks as ground-truth for the subsequent segmentation model. On PASCAL VOC 2012 and MS-COCO, we show that CONTA boosts various popular WSSS methods to new state-of-the-arts.

56.A light-weight method to foster the (Grad)CAM interpretability and explainability of classification networks ⬇️

We consider a light-weight method which allows to improve the explainability of localized classification networks. The method considers (Grad)CAM maps during the training process by modification of the training loss and does not require additional structural elements. It is demonstrated that the (Grad)CAM interpretability, as measured by several indicators, can be improved in this way. Since the method shall be applicable on embedded systems and on standard deeper architectures, it essentially takes advantage of second order derivatives during the training and does not require additional model layers.

57.Neural Twins Talk ⬇️

Inspired by how the human brain employs more neural pathways when increasing the focus on a subject, we introduce a novel twin cascaded attention model that outperforms a state-of-the-art image captioning model that was originally implemented using one channel of attention for the visual grounding task. Visual grounding ensures the existence of words in the caption sentence that are grounded into a particular region in the input image. After a deep learning model is trained on visual grounding task, the model employs the learned patterns regarding the visual grounding and the order of objects in the caption sentences, when generating captions. We report the results of our experiments in three image captioning tasks on the COCO dataset. The results are reported using standard image captioning metrics to show the improvements achieved by our model over the previous image captioning model. The results gathered from our experiments suggest that employing more parallel attention pathways in a deep neural network leads to higher performance. Our implementation of NTT is publicly available at: this https URL.

58.Dense-View GEIs Set: View Space Covering for Gait Recognition based on Dense-View GAN ⬇️

Gait recognition has proven to be effective for long-distance human recognition. But view variance of gait features would change human appearance greatly and reduce its performance. Most existing gait datasets usually collect data with a dozen different angles, or even more few. Limited view angles would prevent learning better view invariant feature. It can further improve robustness of gait recognition if we collect data with various angles at 1 degree interval. But it is time consuming and labor consuming to collect this kind of dataset. In this paper, we, therefore, introduce a Dense-View GEIs Set (DV-GEIs) to deal with the challenge of limited view angles. This set can cover the whole view space, view angle from 0 degree to 180 degree with 1 degree interval. In addition, Dense-View GAN (DV-GAN) is proposed to synthesize this dense view set. DV-GAN consists of Generator, Discriminator and Monitor, where Monitor is designed to preserve human identification and view information. The proposed method is evaluated on the CASIA-B and OU-ISIR dataset. The experimental results show that DV-GEIs synthesized by DV-GAN is an effective way to learn better view invariant feature. We believe the idea of dense view generated samples will further improve the development of gait recognition.

59.Dictionary Learning with Low-rank Coding Coefficients for Tensor Completion ⬇️

In this paper, we propose a novel tensor learning and coding model for third-order data completion. Our model is to learn a data-adaptive dictionary from the given observations, and determine the coding coefficients of third-order tensor tubes. In the completion process, we minimize the low-rankness of each tensor slice containing the coding coefficients. By comparison with the traditional pre-defined transform basis, the advantages of the proposed model are that (i) the dictionary can be learned based on the given data observations so that the basis can be more adaptively and accurately constructed, and (ii) the low-rankness of the coding coefficients can allow the linear combination of dictionary features more effectively. Also we develop a multi-block proximal alternating minimization algorithm for solving such tensor learning and coding model, and show that the sequence generated by the algorithm can globally converge to a critical point. Extensive experimental results for real data sets such as videos, hyperspectral images, and traffic data are reported to demonstrate these advantages and show the performance of the proposed tensor learning and coding method is significantly better than the other tensor completion methods in terms of several evaluation metrics.

60.SIA-GCN: A Spatial Information Aware Graph Neural Network with 2D Convolutions for Hand Pose Estimation ⬇️

Graph Neural Networks (GNNs) generalize neural networks from applications on regular structures to applications on arbitrary graphs, and have shown success in many application domains such as computer vision, social networks and chemistry. In this paper, we extend GNNs along two directions: a) allowing features at each node to be represented by 2D spatial confidence maps instead of 1D vectors; and b) proposing an efficient operation to integrate information from neighboring nodes through 2D convolutions with different learnable kernels at each edge. The proposed SIA-GCN can efficiently extract spatial information from 2D maps at each node and propagate them through graph convolution. By associating each edge with a designated convolution kernel, the SIA-GCN could capture different spatial relationships for different pairs of neighboring nodes. We demonstrate the utility of SIA-GCN on the task of estimating hand keypoints from single-frame images, where the nodes represent the 2D coordinate heatmaps of keypoints and the edges denote the kinetic relationships between keypoints. Experiments on multiple datasets show that SIA-GCN provides a flexible and yet powerful framework to account for structural constraints between keypoints, and can achieve state-of-the-art performance on the task of hand pose estimation.

61.Online Learnable Keyframe Extraction in Videos and its Application with Semantic Word Vector in Action Recognition ⬇️

Video processing has become a popular research direction in computer vision due to its various applications such as video summarization, action recognition, etc. Recently, deep learning-based methods have achieved impressive results in action recognition. However, these methods need to process a full video sequence to recognize the action, even though most of these frames are similar and non-essential to recognizing a particular action. Additionally, these non-essential frames increase the computational cost and can confuse a method in action recognition. Instead, the important frames called keyframes not only are helpful in the recognition of an action but also can reduce the processing time of each video sequence for classification or in other applications, e.g. summarization. As well, current methods in video processing have not yet been demonstrated in an online fashion.
Motivated by the above, we propose an online learnable module for keyframe extraction. This module can be used to select key-shots in video and thus can be applied to video summarization. The extracted keyframes can be used as input to any deep learning-based classification model to recognize action. We also propose a plugin module to use the semantic word vector as input along with keyframes and a novel train/test strategy for the classification models. To our best knowledge, this is the first time such an online module and train/test strategy have been proposed.
The experimental results on many commonly used datasets in video summarization and in action recognition have shown impressive results using the proposed module.

62.Towards General Purpose and Geometry Preserving Single-View Depth Estimation ⬇️

Single-view depth estimation plays a crucial role in scene understanding for AR applications and 3D modelling as it allows to retrieve the geometry of a scene. However, it is only possible if the inverse depth estimates are unbiased, i.e. they are either absolute or Up-to-Scale (UTS). In recent years, great progress has been made in general-purpose single-view depth estimation. Nevertheless, the latest general-purpose models were trained using ranking or on Up-to-Shift-Scale (UTSS) data. As a result, they provide UTSS predictions that cannot be used to reconstruct scene geometry. In this work, we strive to build a general-purpose single-view UTS depth estimation model. Following Ranftl et. al., we train our model on a mixture of datasets and test it on several previously unseen datasets. We show that our method outperforms previous state-of-the-art UTS models. We train several light-weight models following the proposed training scheme and prove that our ideas are applicable for computationally efficient depth estimation.

63.Graph Adversarial Networks: Protecting Information against Adversarial Attacks ⬇️

We explore the problem of protecting information when learning with graph-structured data. While the advent of Graph Neural Networks (GNNs) has greatly improved node and graph representational learning in many applications, the neighborhood aggregation paradigm exposes additional vulnerabilities to attackers seeking to extract node-level information about sensitive attributes. To counter this, we propose a minimax game between the desired GNN encoder and the worst-case attacker. The resulting adversarial training creates a strong defense against inference attacks, while only suffering small loss in task performance. We analyze the effectiveness of our framework against a worst-case adversary, and characterize the trade-off between predictive accuracy and adversarial defense. Experiments across multiple datasets from recommender systems, knowledge graphs and quantum chemistry demonstrate that the proposed approach provides a robust defense across various graph structures and tasks, while producing competitive GNN encoders.

64.EIS -- a family of activation functions combining Exponential, ISRU, and Softplus ⬇️

Activation functions play a pivotal role in the function learning using neural networks. The non-linearity in the learned function is achieved by repeated use of the activation function. Over the years, numerous activation functions have been proposed to improve accuracy in several tasks. Basic functions like ReLU, Exponential, Tanh, or Softplus have been favorite among the deep learning community because of their simplicity. In recent years, several novel activation functions arising from these basic functions have been proposed, which have improved accuracy in some challenging datasets with complicated models. We propose a five hyper-parameters family of activation functions, namely EIS, defined as, [ \frac{x(\ln(1+e^x))^\alpha}{\sqrt{\beta+\gamma x^2}+\delta e^{-\theta x}}. ] We show examples of activation functions from the EIS family which outperform widely used activation functions on some well known datasets and models. For example, $\frac{x\ln(1+e^x)}{x+1.16e^{-x}}$ beats ReLU by 0.89% in DenseNet-169, 0.24% in Inception V3 in CIFAR100 dataset while 1.13% in Inception V3, 0.13% in DenseNet-169, 0.94% in SimpleNet model in CIFAR10 dataset. Also, $\frac{x\ln(1+e^x)}{\sqrt{1+x^2}}$ beats ReLU by 1.68% in DenseNet-169, 0.30% in Inception V3 in CIFAR100 dataset while 1.0% in Inception V3, 0.15% in DenseNet-169, 1.13% in SimpleNet model in CIFAR10 dataset

65.The Smart Parking Management System ⬇️

With growing, Car parking increases with the number of car users. With the increased use of smartphones and their applications, users prefer mobile phone-based solutions. This paper proposes the Smart Parking Management System (SPMS) that depends on Arduino parts, Android applications, and based on IoT. This gave the client the ability to check available parking spaces and reserve a parking spot. IR sensors are utilized to know if a car park space is allowed. Its area data are transmitted using the WI-FI module to the server and are recovered by the mobile application which offers many options attractively and with no cost to users and lets the user check reservation details. With IoT technology, the smart parking system can be connected wirelessly to easily track available locations.

66.Group Whitening: Balancing Learning Efficiency and Representational Capacity ⬇️

Batch normalization (BN) is an important technique commonly incorporated into deep learning models to perform standardization within mini-batches. The merits of BN in improving model's learning efficiency can be further amplified by applying whitening, while its drawbacks in estimating population statistics for inference can be avoided through group normalization (GN). This paper proposes group whitening (GW), which elaborately exploits the advantages of the whitening operation and avoids the disadvantages of normalization within mini-batches. Specifically, GW divides the neurons of a sample into groups for standardization, like GN, and then further decorrelates the groups. In addition, we quantitatively analyze the constraint imposed by normalization, and show how the batch size (group number) affects the performance of batch (group) normalized networks, from the perspective of model's representational capacity. This analysis provides theoretical guidance for applying GW in practice. Finally, we apply the proposed GW to ResNet and ResNeXt architectures and conduct experiments on the ImageNet and COCO benchmarks. Results show that GW consistently improves the performance of different architectures, with absolute gains of $1.02%$ $\sim$ $1.49%$ in top-1 accuracy on ImageNet and $1.82%$ $\sim$ $3.21%$ in bounding box AP on COCO.

67.AI Progress in Skin Lesion Analysis ⬇️

We examine progress in the use of AI for detecting skin lesions, with particular emphasis on the erythema migrans rash of acute Lyme disease, and other lesions, such as those from conditions like herpes zoster (shingles), tinea corporis, erythema multiforme, cellulitis, insect bites, or tick bites. We discuss important challenges for these applications, including the problem of AI bias especially regarding the lack of skin images in dark skinned individuals, being able to accurately detect, delineate, and segment lesions or regions of interest compared to normal skin in images, and the challenge of low shot learning (addressing classification with a paucity of training images). Solving these problems ranges from being highly desirable requirements, e.g. for delineation, which may be useful to disambiguate between similar types of lesions, and perform improved diagnostics, or required, as is the case for AI de-biasing, to allow for the deployment of fair AI techniques in the clinic for skin lesion analysis. For the problem of low shot learning in particular, we report skin analysis algorithms that gracefully degrade and still perform well at low shots, when compared to baseline algorithms: when using a little as 10 training exemplars per class, the baseline DL algorithm performance significantly degrades, with accuracy of 56.41%, close to chance, whereas the best performing low shot algorithm yields an accuracy of 83.33%.

68.High-throughput molecular imaging via deep learning enabled Raman spectroscopy ⬇️

Raman spectroscopy enables non-destructive, label-free imaging with unprecedented molecular contrast but is limited by slow data acquisition, largely preventing high-throughput imaging applications. Here, we present a comprehensive framework for higher-throughput molecular imaging via deep learning enabled Raman spectroscopy, termed DeepeR, trained on a large dataset of hyperspectral Raman images, with over 1.5 million spectra (400 hours of acquisition) in total. We firstly perform denoising and reconstruction of low signal-to-noise ratio Raman molecular signatures via deep learning, with a 9x improvement in mean squared error over state-of-the-art Raman filtering methods. Next, we develop a neural network for robust 2-4x super-resolution of hyperspectral Raman images that preserves molecular cellular information. Combining these approaches, we achieve Raman imaging speed-ups of up to 160x, enabling high resolution, high signal-to-noise ratio cellular imaging in under one minute. Finally, transfer learning is applied to extend DeepeR from cell to tissue-scale imaging. DeepeR provides a foundation that will enable a host of higher-throughput Raman spectroscopy and molecular imaging applications across biomedicine.

69.Characterization of Covid-19 Dataset using Complex Networks and Image Processing ⬇️

This paper aims to explore the structure of pattern behind covid-19 dataset. The dataset includes medical images with positive and negative cases. A sample of 100 sample is chosen, 50 per each class. An histogram frequency is calculated to get features using statistical measurements, besides a feature extraction using Grey Level Co-Occurrence Matrix (GLCM). Using both features are build Complex Networks respectively to analyze the adjacency matrices and check the presence of patterns. Initial experiments introduces the evidence of hidden patterns in the dataset for each class, which are visible using Complex Networks representation.

70.Scalable Transfer Learning with Expert Models ⬇️

Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploiting existing label structures, and use cheap-to-compute performance proxies to select the relevant expert for each target task. This strategy scales the process of transferring to new tasks, since it does not revisit the pre-training data during transfer. Accordingly, it requires little extra compute per target task, and results in a speed-up of 2-3 orders of magnitude compared to competing approaches. Further, we provide an adapter-based architecture able to compress many experts into a single model. We evaluate our approach on two different data sources and demonstrate that it outperforms baselines on over 20 diverse vision tasks in both cases.

71.Deep EvoGraphNet Architecture For Time-Dependent Brain Graph Data Synthesis From a Single Timepoint ⬇️

Learning how to predict the brain connectome (i.e. graph) development and aging is of paramount importance for charting the future of within-disorder and cross-disorder landscape of brain dysconnectivity evolution. Indeed, predicting the longitudinal (i.e., time-dependent ) brain dysconnectivity as it emerges and evolves over time from a single timepoint can help design personalized treatments for disordered patients in a very early stage. Despite its significance, evolution models of the brain graph are largely overlooked in the literature. Here, we propose EvoGraphNet, the first end-to-end geometric deep learning-powered graph-generative adversarial network (gGAN) for predicting time-dependent brain graph evolution from a single timepoint. Our EvoGraphNet architecture cascades a set of time-dependent gGANs, where each gGAN communicates its predicted brain graphs at a particular timepoint to train the next gGAN in the cascade at follow-up timepoint. Therefore, we obtain each next predicted timepoint by setting the output of each generator as the input of its successor which enables us to predict a given number of timepoints using only one single timepoint in an end- to-end fashion. At each timepoint, to better align the distribution of the predicted brain graphs with that of the ground-truth graphs, we further integrate an auxiliary Kullback-Leibler divergence loss function. To capture time-dependency between two consecutive observations, we impose an l1 loss to minimize the sparse distance between two serialized brain graphs. A series of benchmarks against variants and ablated versions of our EvoGraphNet showed that we can achieve the lowest brain graph evolution prediction error using a single baseline timepoint. Our EvoGraphNet code is available at this http URL.

72.Automated Pancreas Segmentation Using Multi-institutional Collaborative Deep Learning ⬇️

The performance of deep learning-based methods strongly relies on the number of datasets used for training. Many efforts have been made to increase the data in the medical image analysis field. However, unlike photography images, it is hard to generate centralized databases to collect medical images because of numerous technical, legal, and privacy issues. In this work, we study the use of federated learning between two institutions in a real-world setting to collaboratively train a model without sharing the raw data across national boundaries. We quantitatively compare the segmentation models obtained with federated learning and local training alone. Our experimental results show that federated learning models have higher generalizability than standalone training.

73.Amodal 3D Reconstruction for Robotic Manipulation via Stability and Connectivity ⬇️

Learning-based 3D object reconstruction enables single- or few-shot estimation of 3D object models. For robotics, this holds the potential to allow model-based methods to rapidly adapt to novel objects and scenes. Existing 3D reconstruction techniques optimize for visual reconstruction fidelity, typically measured by chamfer distance or voxel IOU. We find that when applied to realistic, cluttered robotics environments, these systems produce reconstructions with low physical realism, resulting in poor task performance when used for model-based control. We propose ARM, an amodal 3D reconstruction system that introduces (1) a stability prior over object shapes, (2) a connectivity prior, and (3) a multi-channel input representation that allows for reasoning over relationships between groups of objects. By using these priors over the physical properties of objects, our system improves reconstruction quality not just by standard visual metrics, but also performance of model-based control on a variety of robotics manipulation tasks in challenging, cluttered environments. Code is available at this http URL.

74.Medical Image Segmentation Using Deep Learning: A Survey ⬇️

Deep learning has been widely used for medical image segmentation and a large number of papers has been presented recording the success of deep learning in the field. In this paper, we present a comprehensive thematic survey on medical image segmentation using deep learning techniques. This paper makes two original contributions. Firstly, compared to traditional surveys that directly divide literatures of deep learning on medical image segmentation into many groups and introduce literatures in detail for each group, we classify currently popular literatures according to a multi-level structure from coarse to fine. Secondly, this paper focuses on supervised and weakly supervised learning approaches, without including unsupervised approaches since they have been introduced in many old surveys and they are not popular currently. For supervised learning approaches, we analyze literatures in three aspects: the selection of backbone networks, the design of network blocks, and the improvement of loss functions. For weakly supervised learning approaches, we investigate literature according to data augmentation, transfer learning, and interactive segmentation, separately. Compared to existing surveys, this survey classifies the literatures very differently from before and is more convenient for readers to understand the relevant rationale and will guide them to think of appropriate improvements in medical image segmentation based on deep learning approaches.

75.Cloud Removal for Remote Sensing Imagery via Spatial Attention Generative Adversarial Network ⬇️

Optical remote sensing imagery has been widely used in many fields due to its high resolution and stable geometric properties. However, remote sensing imagery is inevitably affected by climate, especially clouds. Removing the cloud in the high-resolution remote sensing satellite image is an indispensable pre-processing step before analyzing it. For the sake of large-scale training data, neural networks have been successful in many image processing tasks, but the use of neural networks to remove cloud in remote sensing imagery is still relatively small. We adopt generative adversarial network to solve this task and introduce the spatial attention mechanism into the remote sensing imagery cloud removal task, proposes a model named spatial attention generative adversarial network (SpA GAN), which imitates the human visual mechanism, and recognizes and focuses the cloud area with local-to-global spatial attention, thereby enhancing the information recovery of these areas and generating cloudless images with better quality...

76.Visual Steering for One-Shot Deep Neural Network Synthesis ⬇️

Recent advancements in the area of deep learning have shown the effectiveness of very large neural networks in several applications. However, as these deep neural networks continue to grow in size, it becomes more and more difficult to configure their many parameters to obtain good results. Presently, analysts must experiment with many different configurations and parameter settings, which is labor-intensive and time-consuming. On the other hand, the capacity of fully automated techniques for neural network architecture search is limited without the domain knowledge of human experts. To deal with the problem, we formulate the task of neural network architecture optimization as a graph space exploration, based on the one-shot architecture search technique. In this approach, a super-graph of all candidate architectures is trained in one-shot and the optimal neural network is identified as a sub-graph. In this paper, we present a framework that allows analysts to effectively build the solution sub-graph space and guide the network search by injecting their domain knowledge. Starting with the network architecture space composed of basic neural network components, analysts are empowered to effectively select the most promising components via our one-shot search scheme. Applying this technique in an iterative manner allows analysts to converge to the best performing neural network architecture for a given application. During the exploration, analysts can use their domain knowledge aided by cues provided from a scatterplot visualization of the search space to edit different components and guide the search for faster convergence. We designed our interface in collaboration with several deep learning researchers and its final effectiveness is evaluated with a user study and two case studies.

77.Interventional Few-Shot Learning ⬇️

We uncover an ever-overlooked deficiency in the prevailing Few-Shot Learning (FSL) methods: the pre-trained knowledge is indeed a confounder that limits the performance. This finding is rooted from our causal assumption: a Structural Causal Model (SCM) for the causalities among the pre-trained knowledge, sample features, and labels. Thanks to it, we propose a novel FSL paradigm: Interventional Few-Shot Learning (IFSL). Specifically, we develop three effective IFSL algorithmic implementations based on the backdoor adjustment, which is essentially a causal intervention towards the SCM of many-shot learning: the upper-bound of FSL in a causal view. It is worth noting that the contribution of IFSL is orthogonal to existing fine-tuning and meta-learning based FSL methods, hence IFSL can improve all of them, achieving a new 1-/5-shot state-of-the-art on \textit{mini}ImageNet, \textit{tiered}ImageNet, and cross-domain CUB. Code is released at this https URL.

78.Sparse-data based 3D surface reconstruction with vector matching ⬇️

Three dimensional surface reconstruction based on two dimensional sparse information in the form of only a small number of level lines of the surface with moderately complex structures, containing both structured and unstructured geometries, is considered in this paper. A new model has been proposed which is based on the idea of using normal vector matching combined with a first order and a second order total variation regularizers. A fast algorithm based on the augmented Lagrangian is also proposed. Numerical experiments are provided showing the effectiveness of the model and the algorithm in reconstructing surfaces with detailed features and complex structures for both synthetic and real world digital maps.

79.VATLD: A Visual Analytics System to Assess, Understand and Improve Traffic Light Detection ⬇️

Traffic light detection is crucial for environment perception and decision-making in autonomous driving. State-of-the-art detectors are built upon deep Convolutional Neural Networks (CNNs) and have exhibited promising performance. However, one looming concern with CNN based detectors is how to thoroughly evaluate the performance of accuracy and robustness before they can be deployed to autonomous vehicles. In this work, we propose a visual analytics system, VATLD, equipped with a disentangled representation learning and semantic adversarial learning, to assess, understand, and improve the accuracy and robustness of traffic light detectors in autonomous driving applications. The disentangled representation learning extracts data semantics to augment human cognition with human-friendly visual summarization, and the semantic adversarial learning efficiently exposes interpretable robustness risks and enables minimal human interaction for actionable insights. We also demonstrate the effectiveness of various performance improvement strategies derived from actionable insights with our visual analytics system, VATLD, and illustrate some practical implications for safety-critical applications in autonomous driving.

80.Classification and understanding of cloud structures via satellite images with EfficientUNet ⬇️

Climate change has been a common interest and the forefront of crucial political discussion and decision-making for many years. Shallow clouds play a significant role in understanding the Earth's climate, but they are challenging to interpret and represent in a climate model. By classifying these cloud structures, there is a better possibility of understanding the physical structures of the clouds, which would improve the climate model generation, resulting in a better prediction of climate change or forecasting weather update. Clouds organise in many forms, which makes it challenging to build traditional rule-based algorithms to separate cloud features. In this paper, classification of cloud organization patterns was performed using a new scaled-up version of Convolutional Neural Network (CNN) named as EfficientNet as the encoder and UNet as decoder where they worked as feature extractor and reconstructor of fine grained feature map and was used as a classifier, which will help experts to understand how clouds will shape the future climate. By using a segmentation model in a classification task, it was shown that with a good encoder alongside UNet, it is possible to obtain good performance from this dataset. Dice coefficient has been used for the final evaluation metric, which gave the score of 66.26% and 66.02% for public and private leaderboard on Kaggle competition respectively.

81.Learning to Improve Image Compression without Changing the Standard Decoder ⬇️

In recent years we have witnessed an increasing interest in applying Deep Neural Networks (DNNs) to improve the rate-distortion performance in image compression. However, the existing approaches either train a post-processing DNN on the decoder side, or propose learning for image compression in an end-to-end manner. This way, the trained DNNs are required in the decoder, leading to the incompatibility to the standard image decoders (\eg, JPEG) in personal computers and mobiles. Therefore, we propose learning to improve the encoding performance with the standard decoder. In this paper, We work on JPEG as an example. Specifically, a frequency-domain pre-editing method is proposed to optimize the distribution of DCT coefficients, aiming at facilitating the JPEG compression. Moreover, we propose learning the JPEG quantization table jointly with the pre-editing network. Most importantly, we do not modify the JPEG decoder and therefore our approach is applicable when viewing images with the widely used standard JPEG decoder. The experiments validate that our approach successfully improves the rate-distortion performance of JPEG in terms of various quality metrics, such as PSNR, MS-SSIM and LPIPS. Visually, this translates to better overall color retention especially when strong compression is applied.

82.Interaction-Based Trajectory Prediction Over a Hybrid Traffic Graph ⬇️

Behavior prediction of traffic actors is an essential component of any real-world self-driving system. Actors' long-term behaviors tend to be governed by their interactions with other actors or traffic elements (traffic lights, stop signs) in the scene. To capture this highly complex structure of interactions, we propose to use a hybrid graph whose nodes represent both the traffic actors as well as the static and dynamic traffic elements present in the scene. The different modes of temporal interaction (e.g., stopping and going) among actors and traffic elements are explicitly modeled by graph edges. This explicit reasoning about discrete interaction types not only helps in predicting future motion, but also enhances the interpretability of the model, which is important for safety-critical applications such as autonomous driving. We predict actors' trajectories and interaction types using a graph neural network, which is trained in a semi-supervised manner. We show that our proposed model, TrafficGraphNet, achieves state-of-the-art trajectory prediction accuracy while maintaining a high level of interpretability.

83.ESTAN: Enhanced Small Tumor-Aware Network for Breast Ultrasound Image Segmentation ⬇️

Breast tumor segmentation is a critical task in computer-aided diagnosis (CAD) systems for breast cancer detection because accurate tumor size, shape and location are important for further tumor quantification and classification. However, segmenting small tumors in ultrasound images is challenging, due to the speckle noise, varying tumor shapes and sizes among patients, and the existence of tumor-like image regions. Recently, deep learning-based approaches have achieved great success for biomedical image analysis, but current state-of-the-art approaches achieve poor performance for segmenting small breast tumors. In this paper, we propose a novel deep neural network architecture, namely Enhanced Small Tumor-Aware Network (ESTAN), to accurately and robustly segment breast tumors. ESTAN introduces two encoders to extract and fuse image context information at different scales and utilizes row-column-wise kernels in the encoder to adapt to breast anatomy. We validate the proposed approach and compare it to nine state-of-the-art approaches on three public breast ultrasound datasets using seven quantitative metrics. The results demonstrate that the proposed approach achieves the best overall performance and outperforms all other approaches on small tumor segmentation.

84.Learning Self-Expression Metrics for Scalable and Inductive Subspace Clustering ⬇️

Subspace clustering has established itself as a state-of-the-art approach to clustering high-dimensional data. In particular, methods relying on the self-expressiveness property have recently proved especially successful. However, they suffer from two major shortcomings: First, a quadratic-size coefficient matrix is learned directly, preventing these methods from scaling beyond small datasets. Secondly, the trained models are transductive and thus cannot be used to cluster out-of-sample data unseen during training. Instead of learning self-expression coefficients directly, we propose a novel metric learning approach to learn instead a subspace affinity function using a siamese neural network architecture. Consequently, our model benefits from a constant number of parameters and a constant-size memory footprint, allowing it to scale to considerably larger datasets. In addition, we can formally show that out model is still able to exactly recover subspace clusters given an independence assumption. The siamese architecture in combination with a novel geometric classifier further makes our model inductive, allowing it to cluster out-of-sample data. Additionally, non-linear clusters can be detected by simply adding an auto-encoder module to the architecture. The whole model can then be trained end-to-end in a self-supervised manner. This work in progress reports promising preliminary results on the MNIST dataset. In the spirit of reproducible research, me make all code publicly available. In future work we plan to investigate several extensions of our model and to expand experimental evaluation.

85.Normalization Techniques in Training DNNs: Methodology, Analysis and Application ⬇️

Normalization techniques are essential for accelerating the training and improving the generalization of deep neural networks (DNNs), and have successfully been used in various applications. This paper reviews and comments on the past, present and future of normalization methods in the context of DNN training. We provide a unified picture of the main motivation behind different approaches from the perspective of optimization, and present a taxonomy for understanding the similarities and differences between them. Specifically, we decompose the pipeline of the most representative normalizing activation methods into three components: the normalization area partitioning, normalization operation and normalization representation recovery. In doing so, we provide insight for designing new normalization technique. Finally, we discuss the current progress in understanding normalization methods, and provide a comprehensive review of the applications of normalization for particular tasks, in which it can effectively solve the key issues.

86.Agile Reactive Navigation for A Non-Holonomic Mobile Robot Using A Pixel Processor Array ⬇️

This paper presents an agile reactive navigation strategy for driving a non-holonomic ground vehicle around a preset course of gates in a cluttered environment using a low-cost processor array sensor. This enables machine vision tasks to be performed directly upon the sensor's image plane, rather than using a separate general-purpose computer. We demonstrate a small ground vehicle running through or avoiding multiple gates at high speed using minimal computational resources. To achieve this, target tracking algorithms are developed for the Pixel Processing Array and captured images are then processed directly on the vision sensor acquiring target information for controlling the ground vehicle. The algorithm can run at up to 2000 fps outdoors and 200fps at indoor illumination levels. Conducting image processing at the sensor level avoids the bottleneck of image transfer encountered in conventional sensors. The real-time performance of on-board image processing and robustness is validated through experiments. Experimental results demonstrate that the algorithm's ability to enable a ground vehicle to navigate at an average speed of 2.20 m/s for passing through multiple gates and 3.88 m/s for a 'slalom' task in an environment featuring significant visual clutter.

87.Hierarchical Deep Multi-modal Network for Medical Visual Question Answering ⬇️

Visual Question Answering in Medical domain (VQA-Med) plays an important role in providing medical assistance to the end-users. These users are expected to raise either a straightforward question with a Yes/No answer or a challenging question that requires a detailed and descriptive answer. The existing techniques in VQA-Med fail to distinguish between the different question types sometimes complicates the simpler problems, or over-simplifies the complicated ones. It is certainly true that for different question types, several distinct systems can lead to confusion and discomfort for the end-users. To address this issue, we propose a hierarchical deep multi-modal network that analyzes and classifies end-user questions/queries and then incorporates a query-specific approach for answer prediction. We refer our proposed approach as Hierarchical Question Segregation based Visual Question Answering, in short HQS-VQA. Our contributions are three-fold, viz. firstly, we propose a question segregation (QS) technique for VQAMed; secondly, we integrate the QS model to the hierarchical deep multi-modal neural network to generate proper answers to the queries related to medical images; and thirdly, we study the impact of QS in Medical-VQA by comparing the performance of the proposed model with QS and a model without QS. We evaluate the performance of our proposed model on two benchmark datasets, viz. RAD and CLEF18. Experimental results show that our proposed HQS-VQA technique outperforms the baseline models with significant margins. We also conduct a detailed quantitative and qualitative analysis of the obtained results and discover potential causes of errors and their solutions.

88.Iterative Reconstruction for Low-Dose CT using Deep Gradient Priors of Generative Model ⬇️

Dose reduction in computed tomography (CT) is essential for decreasing radiation risk in clinical applications. Iterative reconstruction is one of the most promising ways to compensate for the increased noise due to reduction of photon flux. Rather than most existing prior-driven algorithms that benefit from manually designed prior functions or supervised learning schemes, in this work we integrate the data-consistency as a conditional term into the iterative generative model for low-dose CT. At first, a score-based generative network is used for unsupervised distribution learning and the gradient of generative density prior is learned from normal-dose images. Then, the annealing Langevin dynamics is employed to update the trained priors with conditional scheme, i.e., the distance between the reconstructed image and the manifold is minimized along with data fidelity during reconstruction. Experimental comparisons demonstrated the noise reduction and detail preservation abilities of the proposed method.

89.COVID-19 Infection Map Generation and Detection from Chest X-Ray Images ⬇️

Computer-aided diagnosis has become a necessity for accurate and immediate coronavirus disease 2019 (COVID-19) detection to aid treatment and prevent the spread of the virus. Compared to other diagnosis methodologies, chest X-ray (CXR) imaging is an advantageous tool since it is fast, low-cost, and easily accessible. Thus, CXR has a great potential not only to help diagnose COVID-19 but also to track the progression of the disease. Numerous studies have proposed to use Deep Learning techniques for COVID-19 diagnosis. However, they have used very limited CXR image repositories for evaluation with a small number, a few hundreds, of COVID-19 samples. Moreover, these methods can neither localize nor grade the severity of COVID-19 infection. For this purpose, recent studies proposed to explore the activation maps of deep networks. However, they remain inaccurate for localizing the actual infestation making them unreliable for clinical use. This study proposes a novel method for the joint localization, severity grading, and detection of COVID-19 from CXR images by generating the so-called infection maps that can accurately localize and grade the severity of COVID-19 infection. To accomplish this, we have compiled the largest COVID-19 dataset up to date with 2951 COVID-19 CXR images, where the annotation of the ground-truth segmentation masks is performed on CXRs by a novel collaborative expert human-machine approach. Furthermore, we publicly release the first CXR dataset with the ground-truth segmentation masks of the COVID-19 infected regions. A detailed set of experiments show that state-of-the-art segmentation networks can learn to localize COVID-19 infection with an F1-score of 85.81%, that is significantly superior to the activation maps created by the previous methods. Finally, the proposed approach achieved a COVID-19 detection performance with 98.37% sensitivity and 99.16% specificity.

90.Quantitative and Qualitative Evaluation of Explainable Deep Learning Methods for Ophthalmic Diagnosis ⬇️

Background: The lack of explanations for the decisions made by algorithms such as deep learning has hampered their acceptance by the clinical community despite highly accurate results on multiple problems. Recently, attribution methods have emerged for explaining deep learning models, and they have been tested on medical imaging problems. The performance of attribution methods is compared on standard machine learning datasets and not on medical images. In this study, we perform a comparative analysis to determine the most suitable explainability method for retinal OCT diagnosis.
Methods: A commonly used deep learning model known as Inception v3 was trained to diagnose 3 retinal diseases - choroidal neovascularization (CNV), diabetic macular edema (DME), and drusen. The explanations from 13 different attribution methods were rated by a panel of 14 clinicians for clinical significance. Feedback was obtained from the clinicians regarding the current and future scope of such methods.
Results: An attribution method based on a Taylor series expansion, called Deep Taylor was rated the highest by clinicians with a median rating of 3.85/5. It was followed by two other attribution methods, Guided backpropagation and SHAP (SHapley Additive exPlanations).
Conclusion: Explanations of deep learning models can make them more transparent for clinical diagnosis. This study compared different explanations methods in the context of retinal OCT diagnosis and found that the best performing method may not be the one considered best for other deep learning tasks. Overall, there was a high degree of acceptance from the clinicians surveyed in the study.
Keywords: explainable AI, deep learning, machine learning, image processing, Optical coherence tomography, retina, Diabetic macular edema, Choroidal Neovascularization, Drusen

91.Deep Learning-based Four-region Lung Segmentation in Chest Radiography for COVID-19 Diagnosis ⬇️

Purpose. Imaging plays an important role in assessing severity of COVID 19 pneumonia. However, semantic interpretation of chest radiography (CXR) findings does not include quantitative description of radiographic opacities. Most current AI assisted CXR image analysis framework do not quantify for regional variations of disease. To address these, we proposed a four region lung segmentation method to assist accurate quantification of COVID 19 pneumonia. Methods. A segmentation model to separate left and right lung is firstly applied, and then a carina and left hilum detection network is used, which are the clinical landmarks to separate the upper and lower lungs. To improve the segmentation performance of COVID 19 images, ensemble strategy incorporating five models is exploited. Using each region, we evaluated the clinical relevance of the proposed method with the Radiographic Assessment of the Quality of Lung Edema (RALE). Results. The proposed ensemble strategy showed dice score of 0.900, which is significantly higher than conventional methods (0.854 0.889). Mean intensities of segmented four regions indicate positive correlation to the extent and density scores of pulmonary opacities under the RALE framework. Conclusion. A deep learning based model in CXR can accurately segment and quantify regional distribution of pulmonary opacities in patients with COVID 19 pneumonia.

92.Potential Features of ICU Admission in X-ray Images of COVID-19 Patients ⬇️

X-ray images may present non-trivial features with predictive information of patients that develop severe symptoms of COVID-19. If true, this hypothesis may have practical value in allocating resources to particular patients while using a relatively inexpensive imaging technique. The difficulty of testing such a hypothesis comes from the need for large sets of labelled data, which not only need to be well-annotated but also should contemplate the post-imaging severity outcome. On this account, this paper presents a methodology for extracting features from a limited data set with outcome label (patient required ICU admission or not) and correlating its significance to an additional, larger data set with hundreds of images. The methodology employs a neural network trained to recognise lung pathologies to extract the semantic features, which are then analysed with a shallow decision tree to limit overfitting while increasing interpretability. This analysis points out that only a few features explain most of the variance between patients that developed severe symptoms. When applied to an unrelated, larger data set with labels extracted from clinical notes, the method classified distinct sets of samples where there was a much higher frequency of labels such as Consolidation', Effusion', and alveolar'. A further brief analysis on the locations of such labels also showed a significant increase in the frequency of words like bilateral', middle', and lower' in patients classified as with higher chances of going severe. The methodology for dealing with the lack of specific ICU label data while attesting correlations with a data set containing text notes is novel; its results suggest that some pathologies should receive higher weights when assessing disease severity.

93.Physics-Guided Recurrent Graph Networks for Predicting Flow and Temperature in River Networks ⬇️

This paper proposes a physics-guided machine learning approach that combines advanced machine learning models and physics-based models to improve the prediction of water flow and temperature in river networks. We first build a recurrent graph network model to capture the interactions among multiple segments in the river network. Then we present a pre-training technique which transfers knowledge from physics-based models to initialize the machine learning model and learn the physics of streamflow and thermodynamics. We also propose a new loss function that balances the performance over different river segments. We demonstrate the effectiveness of the proposed method in predicting temperature and streamflow in a subset of the Delaware River Basin. In particular, we show that the proposed method brings a 33%/14% improvement over the state-of-the-art physics-based model and 24%/14% over traditional machine learning models (e.g., Long-Short Term Memory Neural Network) in temperature/streamflow prediction using very sparse (0.1%) observation data for training. The proposed method has also been shown to produce better performance when generalized to different seasons or river segments with different streamflow ranges.

94.Deep Selective Combinatorial Embedding and Consistency Regularization for Light Field Super-resolution ⬇️

Light field (LF) images acquired by hand-held devices usually suffer from low spatial resolution as the limited detector resolution has to be shared with the angular dimension. LF spatial super-resolution (SR) thus becomes an indispensable part of the LF camera processing pipeline. The high-dimensionality characteristic and complex geometrical structure of LF images make the problem more challenging than traditional single-image SR. The performance of existing methods is still limited as they fail to thoroughly explore the coherence among LF sub-aperture images (SAIs) and are insufficient in accurately preserving the scene's parallax structure. To tackle this challenge, we propose a novel learning-based LF spatial SR framework. Specifically, each SAI of an LF image is first coarsely and individually super-resolved by exploring the complementary information among SAIs with selective combinatorial geometry embedding. To achieve efficient and effective selection of the complementary information, we propose two novel sub-modules conducted hierarchically: the patch selector provides an option of retrieving similar image patches based on offline disparity estimation to handle large-disparity correlations; and the SAI selector adaptively and flexibly selects the most informative SAIs to improve the embedding efficiency. To preserve the parallax structure among the reconstructed SAIs, we subsequently append a consistency regularization network trained over a structure-aware loss function to refine the parallax relationships over the coarse estimation. In addition, we extend the proposed method to irregular LF data. To the best of our knowledge, this is the first learning-based SR method for irregular LF data. Experimental results over both synthetic and real-world LF datasets demonstrate the significant advantage of our approach over state-of-the-art methods.

95.Unsupervised Model Adaptation for Continual Semantic Segmentation ⬇️

We develop an algorithm for adapting a semantic segmentation model that is trained using a labeled source domain to generalize well in an unlabeled target domain. A similar problem has been studied extensively in the unsupervised domain adaptation (UDA) literature, but existing UDA algorithms require access to both the source domain labeled data and the target domain unlabeled data for training a domain agnostic semantic segmentation model. Relaxing this constraint enables a user to adapt pretrained models to generalize in a target domain, without requiring access to source data. To this end, we learn a prototypical distribution for the source domain in an intermediate embedding space. This distribution encodes the abstract knowledge that is learned from the source domain. We then use this distribution for aligning the target domain distribution with the source domain distribution in the embedding space. We provide theoretical analysis and explain conditions under which our algorithm is effective. Experiments on benchmark adaptation task demonstrate our method achieves competitive performance even compared with joint UDA approaches.

96.Generating Realistic COVID19 X-rays with a Mean Teacher + Transfer Learning GAN ⬇️

COVID-19 is a novel infectious disease responsible for over 800K deaths worldwide as of August 2020. The need for rapid testing is a high priority and alternative testing strategies including X-ray image classification are a promising area of research. However, at present, public datasets for COVID19 x-ray images have low data volumes, making it challenging to develop accurate image classifiers. Several recent papers have made use of Generative Adversarial Networks (GANs) in order to increase the training data volumes. But realistic synthetic COVID19 X-rays remain challenging to generate. We present a novel Mean Teacher + Transfer GAN (MTT-GAN) that generates COVID19 chest X-ray images of high quality. In order to create a more accurate GAN, we employ transfer learning from the Kaggle Pneumonia X-Ray dataset, a highly relevant data source orders of magnitude larger than public COVID19 datasets. Furthermore, we employ the Mean Teacher algorithm as a constraint to improve stability of training. Our qualitative analysis shows that the MTT-GAN generates X-ray images that are greatly superior to a baseline GAN and visually comparable to real X-rays. Although board-certified radiologists can distinguish MTT-GAN fakes from real COVID19 X-rays. Quantitative analysis shows that MTT-GAN greatly improves the accuracy of both a binary COVID19 classifier as well as a multi-class Pneumonia classifier as compared to a baseline GAN. Our classification accuracy is favourable as compared to recently reported results in the literature for similar binary and multi-class COVID19 screening tasks.

97.Enhanced 3D Myocardial Strain Estimation from Multi-View 2D CMR Imaging ⬇️

In this paper, we propose an enhanced 3D myocardial strain estimation procedure which combines complementary displacement information from multiple orientations of a single imaging modality (untagged CMR SSFP images). To estimate myocardial strain across the left ventricle, we register the sets of short-axis, four-chamber and twochamber views via a 2D non-rigid registration algorithm implemented in a commercial software (Segment, Medviso). We then create a series of interpolating functions for the three orthogonal directions of motion and use them to deform a tetrahedral mesh representation of a patient-specific left ventricle. Additionally, we correct for overestimation of displacement by introducing a weighting scheme that is based on displacement along the long axis. The procedure was evaluated on the STACOM 2011 dataset containing CMR SSFP images for 16 healthy volunteers. We show increased accuracy in estimating the three strain components (radial, circumferential, longitudinal) compared to reported results in the challenge, for the imaging modality of interest (SSFP). Our peak strain estimates are also significantly closer to reported measurements from studies of a larger cohort in the literature. Our proposed procedure provides a fast way to accurately reconstruct a deforming patient-specific model of the left ventricle using the commonest imaging modality routinely administered in clinical settings, without requiring additional or specialized imaging protocols.

98.Blind Image Super-Resolution with Spatial Context Hallucination ⬇️

Deep convolution neural networks (CNNs) play a critical role in single image super-resolution (SISR) since the amazing improvement of high performance computing. However, most of the super-resolution (SR) methods only focus on recovering bicubic degradation. Reconstructing high-resolution (HR) images from randomly blurred and noisy low-resolution (LR) images is still a challenging problem. In this paper, we propose a novel Spatial Context Hallucination Network (SCHN) for blind super-resolution without knowing the degradation kernel. We find that when the blur kernel is unknown, separate deblurring and super-resolution could limit the performance because of the accumulation of error. Thus, we integrate denoising, deblurring and super-resolution within one framework to avoid such a problem. We train our model on two high quality datasets, DIV2K and Flickr2K. Our method performs better than state-of-the-art methods when input images are corrupted with random blur and noise.

99.Democratizing Artificial Intelligence in Healthcare: A Study of Model Development Across Two Institutions Incorporating Transfer Learning ⬇️

The training of deep learning models typically requires extensive data, which are not readily available as large well-curated medical-image datasets for development of artificial intelligence (AI) models applied in Radiology. Recognizing the potential for transfer learning (TL) to allow a fully trained model from one institution to be fine-tuned by another institution using a much small local dataset, this report describes the challenges, methodology, and benefits of TL within the context of developing an AI model for a basic use-case, segmentation of Left Ventricular Myocardium (LVM) on images from 4-dimensional coronary computed tomography angiography. Ultimately, our results from comparisons of LVM segmentation predicted by a model locally trained using random initialization, versus one training-enhanced by TL, showed that a use-case model initiated by TL can be developed with sparse labels with acceptable performance. This process reduces the time required to build a new model in the clinical environment at a different institution.

100.Deep Artifact-Free Residual Network for Single Image Super-Resolution ⬇️

Recently, convolutional neural networks have shown promising performance for single-image super-resolution. In this paper, we propose Deep Artifact-Free Residual (DAFR) network which uses the merits of both residual learning and usage of ground-truth image as target. Our framework uses a deep model to extract the high-frequency information which is necessary for high-quality image reconstruction. We use a skip-connection to feed the low-resolution image to the network before the image reconstruction. In this way, we are able to use the ground-truth images as target and avoid misleading the network due to artifacts in difference image. In order to extract clean high-frequency information, we train the network in two steps. The first step is a traditional residual learning which uses the difference image as target. Then, the trained parameters of this step are transferred to the main training in the second step. Our experimental results show that the proposed method achieves better quantitative and qualitative image quality compared to the existing methods.

101.SceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors ⬇️

Spatial computing experiences are constrained by the real-world surroundings of the user. In such experiences, augmenting virtual objects to existing scenes require a contextual approach, where geometrical conflicts are avoided, and functional and plausible relationships to other objects are maintained in the target environment. Yet, due to the complexity and diversity of user environments, automatically calculating ideal positions of virtual content that is adaptive to the context of the scene is considered a challenging task. Motivated by this problem, in this paper we introduce SceneGen, a generative contextual augmentation framework that predicts virtual object positions and orientations within existing scenes. SceneGen takes a semantically segmented scene as input, and outputs positional and orientational probability maps for placing virtual content. We formulate a novel spatial Scene Graph representation, which encapsulates explicit topological properties between objects, object groups, and the room. We believe providing explicit and intuitive features plays an important role in informative content creation and user interaction of spatial computing settings, a quality that is not captured in implicit models. We use kernel density estimation (KDE) to build a multivariate conditional knowledge model trained using prior spatial Scene Graphs extracted from real-world 3D scanned data. To further capture orientational properties, we develop a fast pose annotation tool to extend current real-world datasets with orientational labels. Finally, to demonstrate our system in action, we develop an Augmented Reality application, in which objects can be contextually augmented in real-time.