Skip to content

Latest commit

 

History

History
141 lines (141 loc) · 89.7 KB

20201118.md

File metadata and controls

141 lines (141 loc) · 89.7 KB

ArXiv cs.CV --Wed, 18 Nov 2020

1.Deep Active Surface Models ⬇️

Active Surface Models have a long history of being useful to model complex 3D surfaces but only Active Contours have been used in conjunction with deep networks, and then only to produce the data term as well as meta-parameter maps controlling them. In this paper, we advocate a much tighter integration. We introduce layers that implement them that can be integrated seamlessly into Graph Convolutional Networks to enforce sophisticated smoothness priors at an acceptable computational cost. We will show that the resulting Deep Active Surface Models outperform equivalent architectures that use traditional regularization loss terms to impose smoothness priors for 3D surface reconstruction from 2D images and for 3D volume segmentation.

2.Learning Canonical Transformations ⬇️

Humans understand a set of canonical geometric transformations (such as translation and rotation) that support generalization by being untethered to any specific object. We explore inductive biases that help a neural network model learn these transformations in pixel space in a way that can generalize out-of-domain. Specifically, we find that high training set diversity is sufficient for the extrapolation of translation to unseen shapes and scales, and that an iterative training scheme achieves significant extrapolation of rotation in time.

3.Spatio-Temporal Analysis of Facial Actions using Lifecycle-Aware Capsule Networks ⬇️

Most state-of-the-art approaches for Facial Action Unit (AU) detection rely upon evaluating facial expressions from static frames, encoding a snapshot of heightened facial activity. In real-world interactions, however, facial expressions are usually more subtle and evolve in a temporal manner requiring AU detection models to learn spatial as well as temporal information. In this paper, we focus on both spatial and spatio-temporal features encoding the temporal evolution of facial AU activation. For this purpose, we propose the Action Unit Lifecycle-Aware Capsule Network (AULA-Caps) that performs AU detection using both frame and sequence-level features. While at the frame-level the capsule layers of AULA-Caps learn spatial feature primitives to determine AU activations, at the sequence-level, it learns temporal dependencies between contiguous frames by focusing on relevant spatio-temporal segments in the sequence. The learnt feature capsules are routed together such that the model learns to selectively focus more on spatial or spatio-temporal information depending upon the AU lifecycle. The proposed model is evaluated on the commonly used BP4D and GFT benchmark datasets obtaining state-of-the-art results on both the datasets.

4.Facial Expressions as a Vulnerability in Face Recognition ⬇️

This work explores facial expression bias as a security vulnerability of face recognition systems. Face recognition technology has experienced great advances during the last decades. However, despite the great performance achieved by state of the art face recognition systems, the algorithms are still sensitive to a large range of covariates. This work presents a comprehensive analysis of how facial expression bias impacts the performance of face recognition technologies. Our study analyzes: i) facial expression biases in the most popular face recognition databases; and ii) the impact of facial expression in face recognition performances. Our experimental framework includes four face detectors, three face recognition models, and four different databases. Our results demonstrate a huge facial expression bias in the most widely used databases, as well as a related impact of face expression in the performance of state-of-the-art algorithms. This work opens the door to new research lines focused on mitigating the observed vulnerability.

5.P1AC: Revisiting Absolute Pose From a Single Affine Correspondence ⬇️

We introduce a novel solution to the problem of estimating the pose of a calibrated camera given a single observation of an oriented point and an affine correspondence to a reference image. Affine correspondences have traditionally been used to improve feature matching over wide baselines; however, little previous work has considered the use of such correspondences for absolute camera pose computation. The advantage of our approach (P1AC) is that it requires only a single correspondence in the minimal case in comparison to the traditional point-based approach (P3P) which requires at least three points. Our method removes the limiting assumptions made in previous work and provides a general solution that is applicable to large-scale image-based localization. Our evaluation on synthetic data shows that our approach is numerically stable and more robust to point observation noise than P3P. We also evaluate the application of our approach for large-scale image-based localization and demonstrate a practical reduction in the number of iterations and computation time required to robustly localize an image.

6.PaDiM: a Patch Distribution Modeling Framework for Anomaly Detection and Localization ⬇️

We present a new framework for Patch Distribution Modeling, PaDiM, to concurrently detect and localize anomalies in images in a one-class learning setting. PaDiM makes use of a pretrained convolutional neural network (CNN) for patch embedding, and of multivariate Gaussian distributions to get a probabilistic representation of the normal class. It also exploits correlations between the different semantic levels of CNN to better localize anomalies. PaDiM outperforms current state-of-the-art approaches for both anomaly detection and localization on the MVTec AD and STC datasets. To match real-world visual industrial inspection, we extend the evaluation protocol to assess performance of anomaly localization algorithms on non-aligned dataset. The state-of-the-art performance and low complexity of PaDiM make it a good candidate for many industrial applications.

7.A Method to Generate High Precision Mesh Model and RGB-D Datasetfor 6D Pose Estimation Task ⬇️

Recently, 3D version has been improved greatly due to the development of deep neural networks. A high quality dataset is important to the deep learning method. Existing datasets for 3D vision has been constructed, such as Bigbird and YCB. However, the depth sensors used to make these datasets are out of date, which made the resolution and accuracy of the datasets cannot full fill the higher standards of demand. Although the equipment and technology got better, but no one was trying to collect new and better dataset. Here we are trying to fill that gap. To this end, we propose a new method for object reconstruction, which takes into account the speed, accuracy and robustness. Our method could be used to produce large dataset with better and more accurate annotation. More importantly, our data is more close to the rendering data, which shrinking the gap between the real data and synthetic data further.

8.Anatomy Prior Based U-net for Pathology Segmentation with Attention ⬇️

Pathological area segmentation in cardiac magnetic resonance (MR) images plays a vital role in the clinical diagnosis of cardiovascular diseases. Because of the irregular shape and small area, pathological segmentation has always been a challenging task. We propose an anatomy prior based framework, which combines the U-net segmentation network with the attention technique. Leveraging the fact that the pathology is inclusive, we propose a neighborhood penalty strategy to gauge the inclusion relationship between the myocardium and the myocardial infarction and no-reflow areas. This neighborhood penalty strategy can be applied to any two labels with inclusive relationships (such as the whole infarction and myocardium, etc.) to form a neighboring loss. The proposed framework is evaluated on the EMIDEC dataset. Results show that our framework is effective in pathological area segmentation.

9.Recognition and standardization of cardiac MRI orientation via multi-tasking learning and deep neural networks ⬇️

In this paper, we study the problem of imaging orientation in cardiac MRI, and propose a framework to categorize the orientation for recognition and standardization via deep neural networks. The method uses a new multi-tasking strategy, where both the tasks of cardiac segmentation and orientation recognition are simultaneously achieved. For multiple sequences and modalities of MRI, we propose a transfer learning strategy, which adapts our proposed model from a single modality to multiple modalities. We embed the orientation recognition network in a Cardiac MRI Orientation Adjust Tool, i.e., CMRadjustNet. We implemented two versions of CMRadjustNet, including a user-interface (UI) software, and a command-line tool. The former version supports MRI image visualization, orientation prediction, adjustment, and storage operations; and the latter version enables the batch operations. The source code, neural network models and tools have been released and open via this https URL.

10.Multi-frame Feature Aggregation for Real-time Instrument Segmentation in Endoscopic Video ⬇️

Deep learning-based methods have achieved promising results on surgical instrument segmentation. However, the high computation cost may limit the applications of deep models to time-sensitive tasks such as online surgical video analysis for robotic-assisted surgery. Also, current performance may still suffer from challenging conditions in surgical images such as various lighting conditions and the presence of blood. We propose a novel Multi-frame Feature Aggregation (MFFA) module that leverages information of neighboring frames for segmentation while reducing the influence of spatial misalignment between frames. The MFFA module also further aggregates features spatially based on the spatial self-attention mechanism. Neighboring frames usually have similar appearances, so we consider feature aggregation over a frame sequence as an iterative feature aggregation procedure. By distributing the computational workload of deep feature extraction over each frame in a sequence, we can use a lightweight encoder to reduce the computation costs. Moreover, public surgical videos usually are not labeled by frame, so we develop a method that can randomly synthesize a surgical frame sequence from a labeled frame to assist network training. We demonstrate that our approach achieves superior performance to corresponding deeper segmentation models on a public endoscopic sinus surgery dataset.

11.Global Road Damage Detection: State-of-the-art Solutions ⬇️

This paper summarizes the Global Road Damage Detection Challenge (GRDDC), a Big Data Cup organized as a part of the IEEE International Conference on Big Data'2020. The Big Data Cup challenges involve a released dataset and a well-defined problem with clear evaluation metrics. The challenges run on a data competition platform that maintains a leaderboard for the participants. In the presented case, the data constitute 26336 road images collected from India, Japan, and the Czech Republic to propose methods for automatically detecting road damages in these countries. In total, 121 teams from several countries registered for this competition. The submitted solutions were evaluated using two datasets test1 and test2, comprising 2,631 and 2,664 images. This paper encapsulates the top 12 solutions proposed by these teams. The best performing model utilizes YOLO-based ensemble learning to yield an F1 score of 0.67 on test1 and 0.66 on test2. The paper concludes with a review of the facets that worked well for the presented challenge and those that could be improved in future challenges.

12.RAIST: Learning Risk Aware Traffic Interactions via Spatio-Temporal Graph Convolutional Networks ⬇️

A key aspect of driving a road vehicle is to interact with the other road users, assess their intentions and make risk-aware tactical decisions. An intuitive approach of enabling an intelligent automated driving system would be to incorporate some aspects of the human driving behavior. To this end, we propose a novel driving framework for egocentric views, which is based on spatio-temporal traffic graphs. The traffic graphs not only model the spatial interactions amongst the road users, but also their individual intentions through temporally associated message passing. We leverage spatio-temporal graph convolutional network (ST-GCN) to train the graph edges. These edges are formulated using parameterized functions of 3D positions and scene-aware appearance features of road agents. Along with tactical behavior prediction, it is crucial to evaluate the risk assessing ability of the proposed framework. We claim that our framework learns risk aware representations by improving on the task of risk object identification, especially in identifying objects with vulnerable interactions like pedestrians and cyclists.

13.Uncertainty Modelling in Deep Neural Networks for Image Data ⬇️

Quantifying uncertainty in a model's predictions is important as it enables, for example, the safety of an AI system to be increased by acting on the model's output in an informed manner. We cannot expect a system to be 100% accurate or perfect at its task, however, we can equip the system with some tools to inform us if it is not certain about a prediction. This way, a second check can be performed, or the task can be passed to a human specialist. This is crucial for applications where the cost of an error is high, such as in autonomous vehicle control, medical image analysis, financial estimations or legal fields. Deep Neural Networks are powerful black box predictors that have recently achieved impressive performance on a wide spectrum of tasks. Quantifying predictive uncertainty in DNNs is a challenging and yet on-going problem. Although there have been many efforts to equip NNs with tools to estimate uncertainty, such as Monte Carlo Dropout, most of the previous methods only focus on one of the three types of model, data or distributional uncertainty. In this paper we propose a complete framework to capture and quantify all of these three types of uncertainties in DNNs for image classification. This framework includes an ensemble of CNNs for model uncertainty, a supervised reconstruction auto-encoder to capture distributional uncertainty and using the output of activation functions in the last layer of the network, to capture data uncertainty. Finally we demonstrate the efficiency of our method on popular image datasets for classification.

14.Pyramid Point: A Multi-Level Focusing Network for Revisiting Feature Layers ⬇️

We present a method to learn a diverse group of object categories from an unordered point set. We propose our Pyramid Point network, which uses a dense pyramid structure instead of the traditional 'U' shape, typically seen in semantic segmentation networks. This pyramid structure gives a second look, allowing the network to revisit different layers simultaneously, increasing the contextual information by creating additional layers with less noise. We introduce a Focused Kernel Point convolution (FKP Conv), which expands on the traditional point convolutions by adding an attention mechanism to the kernel outputs. This FKP Conv increases our feature quality and allows us to weigh the kernel outputs dynamically. These FKP Convs are the central part of our Recurrent FKP Bottleneck block, which makes up the backbone of our encoder. With this distinct network, we demonstrate competitive performance on three benchmark data sets. We also perform an ablation study to show the positive effects of each element in our FKP Conv.

15.ABC-Net: Semi-Supervised Multimodal GAN-based Engagement Detection using an Affective, Behavioral and Cognitive Model ⬇️

We present ABC-Net, a novel semi-supervised multimodal GAN framework to detect engagement levels in video conversations based on psychology literature. We use three constructs: behavioral, cognitive, and affective engagement, to extract various features that can effectively capture engagement levels. We feed these features to our semi-supervised GAN network that does regression using these latent representations to obtain the corresponding valence and arousal values, which are then categorized into different levels of engagements. We demonstrate the efficiency of our network through experiments on the RECOLA database. To evaluate our method, we analyze and compare our performance on RECOLA and report a relative performance improvement of more than 5% over the baseline methods. To the best of our knowledge, our approach is the first method to classify engagement based on a multimodal semi-supervised network.

16.SeekNet: Improved Human Instance Segmentation via Reinforcement Learning Based Optimized Robot Relocation ⬇️

Amodal recognition is the ability of the system to detect occluded objects. Most state-of-the-art Visual Recognition systems lack the ability to perform amodal recognition. Few studies have achieved amodal recognition through passive prediction or embodied recognition approaches. However, these approaches suffer from challenges in real-world applications, such as dynamic objects. We propose SeekNet, an improved optimization method for amodal recognition through embodied visual recognition. Additionally, we implement SeekNet for social robots, where there are multiple interactions with crowded humans. Hence, we focus on occluded human detection & tracking and showcase the superiority of our algorithm over other baselines. We also experiment with SeekNet to improve the confidence of COVID-19 symptoms pre-screening algorithms using our efficient embodied recognition system.

17.Non-Local Robust Quaternion Matrix Completion for Large-Scale Color Images and Videos Inpainting ⬇️

The image nonlocal self-similarity (NSS) prior refers to the fact that a local patch often has many nonlocal similar patches to it across the image. In this paper we apply such NSS prior to enhance the robust quaternion matrix completion (QMC) method and significantly improve the inpainting performance. A patch group based NSS prior learning scheme is proposed to learn explicit NSS models from natural color images. The NSS-based QMC algorithm computes an optimal low-rank approximation to the high-rank color image, resulting in high PSNR and SSIM measures and particularly the better visual quality. A new joint NSS-base QMC method is also presented to solve the color video inpainting problem based quaternion tensor representation. The numerical experiments on large-scale color images and videos indicate the advantages of NSS-based QMC over the state-of-the-art methods.

18.3D CNNs with Adaptive Temporal Feature Resolutions ⬇️

While state-of-the-art 3D Convolutional Neural Networks (CNN) achieve very good results on action recognition datasets, they are computationally very expensive and require many GFLOPs. While the GFLOPs of a 3D CNN can be decreased by reducing the temporal feature resolution within the network, there is no setting that is optimal for all input clips. In this work, we, therefore, introduce a differentiable Similarity Guided Sampling (SGS) module, which can be plugged into any existing 3D CNN architecture. SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together. As a result, the temporal feature resolution is not anymore static but it varies for each input video clip. By integrating SGS as an additional layer within current 3D CNNs, we can convert them into much more efficient 3D CNNs with adaptive temporal feature resolutions (ATFR). Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs)by half while preserving or even improving the accuracy. We evaluate our module by adding it to multiple state-of-the-art 3D CNNs on various datasets such as Kinetics-600, Kinetics-400, mini-Kinetics, Something-Something V2, UCF101, and HMDB51

19.A Review of Generalized Zero-Shot Learning Methods ⬇️

Generalized zero-shot learning (GZSL) aims to train a model for classifying data samples under the condition that some output classes are unknown during supervised learning. To address this challenging task, GZSL leverages semantic information of both seen (source) and unseen (target) classes to bridge the gap between both seen and unseen classes. Since its introduction, many GZSL models have been formulated. In this review paper, we present a comprehensive review of GZSL. Firstly, we provide an overview of GZSL including the problems and challenging issues. Then, we introduce a hierarchical categorization of the GZSL methods and discuss the representative methods of each category. In addition, we discuss several research directions for future studies.

20.Exploring Self-Attention for Visual Odometry ⬇️

Visual odometry networks commonly use pretrained optical flow networks in order to derive the ego-motion between consecutive frames. The features extracted by these networks represent the motion of all the pixels between frames. However, due to the existence of dynamic objects and texture-less surfaces in the scene, the motion information for every image region might not be reliable for inferring odometry due to the ineffectiveness of dynamic objects in derivation of the incremental changes in position. Recent works in this area lack attention mechanisms in their structures to facilitate dynamic reweighing of the feature maps for extracting more refined egomotion information. In this paper, we explore the effectiveness of self-attention in visual odometry. We report qualitative and quantitative results against the SOTA methods. Furthermore, saliency-based studies alongside specially designed experiments are utilized to investigate the effect of self-attention on VO. Our experiments show that using self-attention allows for the extraction of better features while achieving a better odometry performance compared to networks that lack such structures.

21.Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video ⬇️

Despite the recent success of single image-based 3D human pose and shape estimation methods, recovering temporally consistent and smooth 3D human motion from a video is still challenging. Several video-based methods have been proposed; however, they fail to resolve the single image-based methods' temporal inconsistency issue due to a strong dependency on a static feature of the current frame. In this regard, we present a temporally consistent mesh recovery system (TCMR). It effectively focuses on the past and future frames' temporal information without being dominated by the current static feature. Our TCMR significantly outperforms previous video-based methods in temporal consistency with better per-frame 3D pose and shape accuracy. We will release the codes.

22.Can Semantic Labels Assist Self-Supervised Visual Representation Learning? ⬇️

Recently, contrastive learning has largely advanced the progress of unsupervised visual representation learning. Pre-trained on ImageNet, some self-supervised algorithms reported higher transfer learning performance compared to fully-supervised methods, seeming to deliver the message that human labels hardly contribute to learning transferrable visual features. In this paper, we defend the usefulness of semantic labels but point out that fully-supervised and self-supervised methods are pursuing different kinds of features. To alleviate this issue, we present a new algorithm named Supervised Contrastive Adjustment in Neighborhood (SCAN) that maximally prevents the semantic guidance from damaging the appearance feature embedding. In a series of downstream tasks, SCAN achieves superior performance compared to previous fully-supervised and self-supervised methods, and sometimes the gain is significant. More importantly, our study reveals that semantic labels are useful in assisting self-supervised methods, opening a new direction for the community.

23.Mutual Information Based Method for Unsupervised Disentanglement of Video Representation ⬇️

Video Prediction is an interesting and challenging task of predicting future frames from a given set context frames that belong to a video sequence. Video prediction models have found prospective applications in Maneuver Planning, Health care, Autonomous Navigation and Simulation. One of the major challenges in future frame generation is due to the high dimensional nature of visual data. In this work, we propose Mutual Information Predictive Auto-Encoder (MIPAE) framework, that reduces the task of predicting high dimensional video frames by factorising video representations into content and low dimensional pose latent variables that are easy to predict. A standard LSTM network is used to predict these low dimensional pose representations. Content and the predicted pose representations are decoded to generate future frames. Our approach leverages the temporal structure of the latent generative factors of a video and a novel mutual information loss to learn disentangled video representations. We also propose a metric based on mutual information gap (MIG) to quantitatively access the effectiveness of disentanglement on DSprites and MPI3D-real datasets. MIG scores corroborate with the visual superiority of frames predicted by MIPAE. We also compare our method quantitatively on evaluation metrics LPIPS, SSIM and PSNR.

24.Multi Receptive Field Network for Semantic Segmentation ⬇️

Semantic segmentation is one of the key tasks in computer vision, which is to assign a category label to each pixel in an image. Despite significant progress achieved recently, most existing methods still suffer from two challenging issues: 1) the size of objects and stuff in an image can be very diverse, demanding for incorporating multi-scale features into the fully convolutional networks (FCNs); 2) the pixels close to or at the boundaries of object/stuff are hard to classify due to the intrinsic weakness of convolutional networks. To address the first issue, we propose a new Multi-Receptive Field Module (MRFM), explicitly taking multi-scale features into account. For the second issue, we design an edge-aware loss which is effective in distinguishing the boundaries of object/stuff. With these two designs, our Multi Receptive Field Network achieves new state-of-the-art results on two widely-used semantic segmentation benchmark datasets. Specifically, we achieve a mean IoU of 83.0 on the Cityscapes dataset and 88.4 mean IoU on the Pascal VOC2012 dataset.

25.A Divide et Impera Approach for 3D Shape Reconstruction from Multiple Views ⬇️

Estimating the 3D shape of an object from a single or multiple images has gained popularity thanks to the recent breakthroughs powered by deep learning. Most approaches regress the full object shape in a canonical pose, possibly extrapolating the occluded parts based on the learned priors. However, their viewpoint invariant technique often discards the unique structures visible from the input images. In contrast, this paper proposes to rely on viewpoint variant reconstructions by merging the visible information from the given views. Our approach is divided into three steps. Starting from the sparse views of the object, we first align them into a common coordinate system by estimating the relative pose between all the pairs. Then, inspired by the traditional voxel carving, we generate an occupancy grid of the object taken from the silhouette on the images and their relative poses. Finally, we refine the initial reconstruction to build a clean 3D model which preserves the details from each viewpoint. To validate the proposed method, we perform a comprehensive evaluation on the ShapeNet reference benchmark in terms of relative pose estimation and 3D shape reconstruction.

26.Slender Object Detection: Diagnoses and Improvements ⬇️

In this paper, we are concerned with the detection of a particular type of objects with extreme aspect ratios, namely slender objects. In real-world scenarios as well as widely-used datasets (such as COCO), slender objects are actually very common. However, this type of object has been largely overlooked by previous object detection algorithms. Upon our investigation, for a classical object detection method, a drastic drop of 18.9% mAP on COCO is observed, if solely evaluated on slender objects. Therefore, We systematically study the problem of slender object detection in this work. Accordingly, an analytical framework with carefully designed benchmark and evaluation protocols is established, in which different algorithms and modules can be inspected and compared. Our key findings include: 1) the essential role of anchors in label assignment; 2) the descriptive capability of the 2-point representation; 3) the crucial strategies for improving the detection of slender objects and regular objects. Our work identifies and extends the insights of existing methods that are previously underexploited. Furthermore, we propose a feature adaption strategy that achieves clear and consistent improvements over current representative object detection methods. In particular, a natural and effective extension of the center prior, which leads to a significant improvement on slender objects, is devised. We believe this work opens up new opportunities and calibrates ablation standards for future research in the field of object detection.

27.DeepSeqSLAM: A Trainable CNN+RNN for Joint Global Description and Sequence-based Place Recognition ⬇️

Sequence-based place recognition methods for all-weather navigation are well-known for producing state-of-the-art results under challenging day-night or summer-winter transitions. These systems, however, rely on complex handcrafted heuristics for sequential matching - which are applied on top of a pre-computed pairwise similarity matrix between reference and query image sequences of a single route - to further reduce false-positive rates compared to single-frame retrieval methods. As a result, performing multi-frame place recognition can be extremely slow for deployment on autonomous vehicles or evaluation on large datasets, and fail when using relatively short parameter values such as a sequence length of 2 frames. In this paper, we propose DeepSeqSLAM: a trainable CNN+RNN architecture for jointly learning visual and positional representations from a single monocular image sequence of a route. We demonstrate our approach on two large benchmark datasets, Nordland and Oxford RobotCar - recorded over 728 km and 10 km routes, respectively, each during 1 year with multiple seasons, weather, and lighting conditions. On Nordland, we compare our method to two state-of-the-art sequence-based methods across the entire route under summer-winter changes using a sequence length of 2 and show that our approach can get over 72% AUC compared to 27% AUC for Delta Descriptors and 2% AUC for SeqSLAM; while drastically reducing the deployment time from around 1 hour to 1 minute against both. The framework code and video are available at this https URL

28.Bridging the Performance Gap Between Pose Estimation Networks Trained on Real And Synthetic Data Using Domain Randomization ⬇️

Since the introduction of deep learning methods, pose estimation performance has increased drastically. Usually, large amounts of manually annotated training data are required for these networks to perform. While training on synthetic data can avoid the manual annotation, this introduces another obstacle. There is currently a large performance gap between methods trained on real and synthetic data. This paper introduces a new method, which bridges the gap between real and synthetically trained networks. As opposed to other methods, the network utilizes 3D point clouds. This allows both for domain randomization in 3D and to use neighboring geometric information during inference. Experiments on three large pose estimation benchmarks show that the presented method outperforms previous methods trained on synthetic data and achieves comparable-and sometimes superior-results to existing methods trained on real data.

29.ACSC: Automatic Calibration for Non-repetitive Scanning Solid-State LiDAR and Camera Systems ⬇️

Recently, the rapid development of Solid-State LiDAR (SSL) enables low-cost and efficient obtainment of 3D point clouds from the environment, which has inspired a large quantity of studies and applications. However, the non-uniformity of its scanning pattern, and the inconsistency of the ranging error distribution bring challenges to its calibration task. In this paper, we proposed a fully automatic calibration method for the non-repetitive scanning SSL and camera systems. First, a temporal-spatial-based geometric feature refinement method is presented, to extract effective features from SSL point clouds; then, the 3D corners of the calibration target (a printed checkerboard) are estimated with the reflectance distribution of points. Based on the above, a target-based extrinsic calibration method is finally proposed. We evaluate the proposed method on different types of LiDAR and camera sensor combinations in real conditions, and achieve accuracy and robustness calibration results. The code is available at this https URL .

30.A Digital Image Processing Approach for Hepatic Diseases Staging based on the Glisson's Capsule ⬇️

Due to the need for quick and effective treatments for liver diseases, which are among the most common health problems in the world, staging fibrosis through non-invasive and economic methods has become of great importance. Taking inspiration from diagnostic laparoscopy, used in the past for hepatic diseases, in this paper ultrasound images of the liver are studied, focusing on a specific region of the organ where the Glisson's capsule is visible. In ultrasound images, the Glisson's capsule appears in the shape of a line which can be extracted via classical methods in literature. By making use of a combination of standard image processing techniques and Convolutional Neural Network approaches, the scope of this work is to give evidence to the idea that a great informative potential relies on smoothness of the Glisson's capsule surface. To this purpose, several classifiers are taken into consideration, which deal with different type of data, namely ultrasound images, binary images depicting the Glisson's line, and features vector extracted from the original image. This is a preliminary study that has been retrospectively conducted, based on the results of the elastosonography examination.

31.Generalized Continual Zero-Shot Learning ⬇️

Recently, zero-shot learning (ZSL) emerged as an exciting topic and attracted a lot of attention. ZSL aims to classify unseen classes by transferring the knowledge from seen classes to unseen classes based on the class description. Despite showing promising performance, ZSL approaches assume that the training samples from all seen classes are available during the training, which is practically not feasible. To address this issue, we propose a more generalized and practical setup for ZSL, i.e., continual ZSL (CZSL), where classes arrive sequentially in the form of a task and it actively learns from the changing environment by leveraging the past experience. Further, to enhance the reliability, we develop CZSL for a single head continual learning setting where task identity is revealed during the training process but not during the testing. To avoid catastrophic forgetting and intransigence, we use knowledge distillation and storing and replay the few samples from previous tasks using a small episodic memory. We develop baselines and evaluate generalized CZSL on five ZSL benchmark datasets for two different settings of continual learning: with and without class incremental. Moreover, CZSL is developed for two types of variational autoencoders, which generates two types of features for classification: (i) generated features at output space and (ii) generated discriminative features at the latent space. The experimental results clearly indicate the single head CZSL is more generalizable and suitable for practical applications.

32.Digging Deeper into CRNN Model in Chinese Text Images Recognition ⬇️

Automatic text image recognition is a prevalent application in computer vision field. One efficient way is use Convolutional Recurrent Neural Network(CRNN) to accomplish task in an end-to-end(End2End) fashion. However, CRNN notoriously fails to detect multi-row images and excel-like images. In this paper, we present one alternative to first recognize single-row images, then extend the same architecture to recognize multi-row images with proposed multiple methods. To recognize excel-like images containing box lines, we propose Line-Deep Denoising Convolutional AutoEncoder(Line-DDeCAE) to recover box lines. Finally, we present one Knowledge Distillation(KD) method to compress original CRNN model without loss of generality. To carry out experiments, we first generate artificial samples from one Chinese novel book, then conduct various experiments to verify our methods.

33.Unsupervised BatchNorm Adaptation (UBNA): A Domain Adaptation Method for Semantic Segmentation Without Using Source Domain Representations ⬇️

In this paper we present a solution to the task of "unsupervised domain adaptation (UDA) of a pre-trained semantic segmentation model without relying on any source domain representations". Previous UDA approaches for semantic segmentation either employed simultaneous training of the model in the source and target domains, or they relied on a generator network, replaying source domain data to the model during adaptation. In contrast, we present our novel Unsupervised BatchNorm Adaptation (UBNA) method, which adapts a pre-trained model to an unseen target domain without using---beyond the existing model parameters from pre-training---any source domain representations (neither data, nor generators) and which can also be applied in an online setting or using just a few unlabeled images from the target domain in a few-shot manner. Specifically, we partially adapt the normalization layer statistics to the target domain using an exponentially decaying momentum factor, thereby mixing the statistics from both domains. By evaluation on standard UDA benchmarks for semantic segmentation we show that this is superior to a model without adaptation and to baseline approaches using statistics from the target domain only. Compared to standard UDA approaches we report a trade-off between performance and usage of source domain representations.

34.Exploring Intermediate Representation for Monocular Vehicle Pose Estimation ⬇️

We present a new learning-based approach to recover egocentric 3D vehicle pose from a single RGB image. In contrast to previous works that directly map from local appearance to 3D angles, we explore a progressive approach by extracting meaningful Intermediate Geometrical Representations (IGRs) for 3D pose estimation. We design a deep model that transforms perceived intensities to IGRs, which are mapped to a 3D representation encoding object orientation in the camera coordinate system. To fulfill our goal, we need to specify what IGRs to use and how to learn them more effectively. We answer the former question by designing an interpolated cuboid representation that derives from primitive 3D annotation readily. The latter question motivates us to incorporate geometry knowledge by designing a new loss function based on a projective invariant. This loss function allows unlabeled data to be used in the training stage which is validated to improve representation learning. Our system outperforms previous monocular RGB-based methods for joint vehicle detection and pose estimation on the KITTI benchmark, achieving performance even comparable to stereo methods. Code and pre-trained models will be available at the project website.

35.SRF-GAN: Super-Resolved Feature GAN for Multi-Scale Representation ⬇️

Recent convolutional object detectors exploit multi-scale feature representations added with top-down pathway in order to detect objects at different scales and learn stronger semantic feature responses. In general, during the top-down feature propagation, the coarser feature maps are upsampled to be combined with the features forwarded from bottom-up pathway, and the combined stronger semantic features are inputs of detector's headers. However, simple interpolation methods (e.g. nearest neighbor and bilinear) are still used for increasing feature resolutions although they cause noisy and blurred features. In this paper, we propose a novel generator for super-resolving features of the convolutional object detectors. To achieve this, we first design super-resolved feature GAN (SRF-GAN) consisting of a detection-based generator and a feature patch discriminator. In addition, we present SRF-GAN losses for generating the high quality of super-resolved features and improving detection accuracy together. Our SRF generator can substitute for the traditional interpolation methods, and easily fine-tuned combined with other conventional detectors. To prove this, we have implemented our SRF-GAN by using the several recent one-stage and two-stage detectors, and improved detection accuracy over those detectors. Code is available at this https URL.

36.EvoPose2D: Pushing the Boundaries of 2D Human Pose Estimation using Neuroevolution ⬇️

Neural architecture search has proven to be highly effective in the design of computationally efficient, task-specific convolutional neural networks across several areas of computer vision. In 2D human pose estimation, however, its application has been limited by high computational demands. Hypothesizing that neural architecture search holds great potential for 2D human pose estimation, we propose a new weight transfer scheme that relaxes function-preserving mutations, enabling us to accelerate neuroevolution in a flexible manner. Our method produces 2D human pose network designs that are more efficient and more accurate than state-of-the-art hand-designed networks. In fact, the generated networks can process images at higher resolutions using less computation than previous networks at lower resolutions, permitting us to push the boundaries of 2D human pose estimation. Our baseline network designed using neuroevolution, which we refer to as EvoPose2D-S, provides comparable accuracy to SimpleBaseline while using 4.9x fewer floating-point operations and 13.5x fewer parameters. Our largest network, EvoPose2D-L, achieves new state-of-the-art accuracy on the Microsoft COCO Keypoints benchmark while using 2.0x fewer operations and 4.3x fewer parameters than its nearest competitor.

37.Shared Cross-Modal Trajectory Prediction for Autonomous Driving ⬇️

Predicting future trajectories of traffic agents in highly interactive environments is an essential and challenging problem for the safe operation of autonomous driving systems. On the basis of the fact that self-driving vehicles are equipped with various types of sensors (e.g., LiDAR scanner, RGB camera, radar, etc.), we propose a Cross-Modal Embedding framework that aims to benefit from the use of multiple input modalities. At training time, our model learns to embed a set of complementary features in a shared latent space by jointly optimizing the objective functions across different types of input data. At test time, a single input modality (e.g., LiDAR data) is required to generate predictions from the input perspective (i.e., in the LiDAR space), while taking advantages from the model trained with multiple sensor modalities. An extensive evaluation is conducted to show the efficacy of the proposed framework using two benchmark driving datasets.

38.Transducer Adaptive Ultrasound Volume Reconstruction ⬇️

Reconstructed 3D ultrasound volume provides more context information compared to a sequence of 2D scanning frames, which is desirable for various clinical applications such as ultrasound-guided prostate biopsy. Nevertheless, 3D volume reconstruction from freehand 2D scans is a very challenging problem, especially without the use of external tracking devices. Recent deep learning based methods demonstrate the potential of directly estimating inter-frame motion between consecutive ultrasound frames. However, such algorithms are specific to particular transducers and scanning trajectories associated with the training data, which may not be generalized to other image acquisition settings. In this paper, we tackle the data acquisition difference as a domain shift problem and propose a novel domain adaptation strategy to adapt deep learning algorithms to data acquired with different transducers. Specifically, feature extractors that generate transducer-invariant features from different datasets are trained by minimizing the discrepancy between deep features of paired samples in a latent space. Our results show that the proposed domain adaptation method can successfully align different feature distributions while preserving the transducer-specific information for universal freehand ultrasound volume reconstruction.

39.Quantifying Sources of Uncertainty in Deep Learning-Based Image Reconstruction ⬇️

Image reconstruction methods based on deep neural networks have shown outstanding performance, equalling or exceeding the state-of-the-art results of conventional approaches, but often do not provide uncertainty information about the reconstruction. In this work we propose a scalable and efficient framework to simultaneously quantify aleatoric and epistemic uncertainties in learned iterative image reconstruction. We build on a Bayesian deep gradient descent method for quantifying epistemic uncertainty, and incorporate the heteroscedastic variance of the noise to account for the aleatoric uncertainty. We show that our method exhibits competitive performance against conventional benchmarks for computed tomography with both sparse view and limited angle data. The estimated uncertainty captures the variability in the reconstructions, caused by the restricted measurement model, and by missing information, due to the limited angle geometry.

40.Semi-Supervised Few-Shot Atomic Action Recognition ⬇️

Despite excellent progress has been made, the performance on action recognition still heavily relies on specific datasets, which are difficult to extend new action classes due to labor-intensive labeling. Moreover, the high diversity in Spatio-temporal appearance requires robust and representative action feature aggregation and attention. To address the above issues, we focus on atomic actions and propose a novel model for semi-supervised few-shot atomic action recognition. Our model features unsupervised and contrastive video embedding, loose action alignment, multi-head feature comparison, and attention-based aggregation, together of which enables action recognition with only a few training examples through extracting more representative features and allowing flexibility in spatial and temporal alignment and variations in the action. Experiments show that our model can attain high accuracy on representative atomic action datasets outperforming their respective state-of-the-art classification accuracy in full supervision setting.

41.Domain Adaptation based Technique for Image Emotion Recognition using Pre-trained Facial Expression Recognition Models ⬇️

In this paper, a domain adaptation based technique for recognizing the emotions in images containing facial, non-facial, and non-human components has been proposed. We have also proposed a novel technique to explain the proposed system's predictions in terms of Intersection Score. Image emotion recognition is useful for graphics, gaming, animation, entertainment, and cinematography. However, well-labeled large scale datasets and pre-trained models are not available for image emotion recognition. To overcome this challenge, we have proposed a deep learning approach based on an attentional convolutional network that adapts pre-trained facial expression recognition models. It detects the visual features of an image and performs emotion classification based on them. The experiments have been performed on the Flickr image dataset, and the images have been classified in 'angry,' 'happy,' 'sad,' and 'neutral' emotion classes. The proposed system has demonstrated better performance than the benchmark results with an accuracy of 63.87% for image emotion recognition. We have also analyzed the embedding plots for various emotion classes to explain the proposed system's predictions.

42.Learning Efficient GANs via Differentiable Masks and co-Attention Distillation ⬇️

Generative Adversarial Networks (GANs) have been widely-used in image translation, but their high computational and storage costs impede the deployment on mobile devices. Prevalent methods for CNN compression cannot be directly applied to GANs due to the complicated generator architecture and the unstable adversarial training. To solve these, in this paper, we introduce a novel GAN compression method, termed DMAD, by proposing a Differentiable Mask and a co-Attention Distillation. The former searches for a light-weight generator architecture in a training-adaptive manner. To overcome channel inconsistency when pruning the residual connections, an adaptive cross-block group sparsity is further incorporated. The latter simultaneously distills informative attention maps from both the generator and discriminator of a pre-trained model to the searched generator, effectively stabilizing the adversarial training of our light-weight model. Experiments show that DMAD can reduce the Multiply Accumulate Operations (MACs) of CycleGAN by 13x and that of Pix2Pix by 4x while retaining a comparable performance against the full model. Code is available at this https URL.

43.Extreme Value Preserving Networks ⬇️

Recent evidence shows that convolutional neural networks (CNNs) are biased towards textures so that CNNs are non-robust to adversarial perturbations over textures, while traditional robust visual features like SIFT (scale-invariant feature transforms) are designed to be robust across a substantial range of affine distortion, addition of noise, etc with the mimic of human perception nature. This paper aims to leverage good properties of SIFT to renovate CNN architectures towards better accuracy and robustness. We borrow the scale-space extreme value idea from SIFT, and propose extreme value preserving networks (EVPNets). Experiments demonstrate that EVPNets can achieve similar or better accuracy than conventional CNNs, while achieving much better robustness on a set of adversarial attacks (FGSM,PGD,etc) even without adversarial training.

44.Vis-CRF, A Classical Receptive Field Model for VISION ⬇️

Over the last decade, a variety of new neurophysiological experiments have led to new insights as to how, when and where retinal processing takes place, and the nature of the retinal representation encoding sent to the cortex for further processing. Based on these neurobiological discoveries, in our previous work, we provided computer simulation evidence to suggest that Geometrical illusions are explained in part, by the interaction of multiscale visual processing performed in the retina. The output of our retinal stage model, named Vis-CRF, is presented here for a sample of natural image and for several types of Tilt Illusion, in which the final tilt percept arises from multiple scale processing of Difference of Gaussians (DoG) and the perceptual interaction of foreground and background elements (Nematzadeh and Powers, 2019; Nematzadeh, 2018; Nematzadeh, Powers and Lewis, 2017; Nematzadeh, Lewis and Powers, 2015).

45.CG-Net: Conditional GIS-aware Network for Individual Building Segmentation in VHR SAR Images ⬇️

Object retrieval and reconstruction from very high resolution (VHR) synthetic aperture radar (SAR) images are of great importance for urban SAR applications, yet highly challenging owing to the complexity of SAR data. This paper addresses the issue of individual building segmentation from a single VHR SAR image in large-scale urban areas. To achieve this, we introduce building footprints from GIS data as complementary information and propose a novel conditional GIS-aware network (CG-Net). The proposed model learns multi-level visual features and employs building footprints to normalize the features for predicting building masks in the SAR image. We validate our method using a high resolution spotlight TerraSAR-X image collected over Berlin. Experimental results show that the proposed CG-Net effectively brings improvements with variant backbones. We further compare two representations of building footprints, namely complete building footprints and sensor-visible footprint segments, for our task, and conclude that the use of the former leads to better segmentation results. Moreover, we investigate the impact of inaccurate GIS data on our CG-Net, and this study shows that CG-Net is robust against positioning errors in GIS data. In addition, we propose an approach of ground truth generation of buildings from an accurate digital elevation model (DEM), which can be used to generate large-scale SAR image datasets. The segmentation results can be applied to reconstruct 3D building models at level-of-detail (LoD) 1, which is demonstrated in our experiments.

46.2D+3D Facial Expression Recognition via Discriminative Dynamic Range Enhancement and Multi-Scale Learning ⬇️

In 2D+3D facial expression recognition (FER), existing methods generate multi-view geometry maps to enhance the depth feature representation. However, this may introduce false estimations due to local plane fitting from incomplete point clouds. In this paper, we propose a novel Map Generation technique from the viewpoint of information theory, to boost the slight 3D expression differences from strong personality variations. First, we examine the HDR depth data to extract the discriminative dynamic range $r_{dis}$, and maximize the entropy of $r_{dis}$ to a global optimum. Then, to prevent the large deformation caused by over-enhancement, we introduce a depth distortion constraint and reduce the complexity from $O(KN^2)$ to $O(KN\tau)$. Furthermore, the constrained optimization is modeled as a $K$-edges maximum weight path problem in a directed acyclic graph, and we solve it efficiently via dynamic programming. Finally, we also design an efficient Facial Attention structure to automatically locate subtle discriminative facial parts for multi-scale learning, and train it with a proposed loss function $\mathcal{L}_{FA}$ without any facial landmarks. Experimental results on different datasets show that the proposed method is effective and outperforms the state-of-the-art 2D+3D FER methods in both FER accuracy and the output entropy of the generated maps.

47.EffiScene: Efficient Per-Pixel Rigidity Inference for Unsupervised Joint Learning of Optical Flow, Depth, Camera Pose and Motion Segmentation ⬇️

This paper addresses the challenging unsupervised scene flow estimation problem by jointly learning four low-level vision sub-tasks: optical flow $\textbf{F}$, stereo-depth $\textbf{D}$, camera pose $\textbf{P}$ and motion segmentation $\textbf{S}$. Our key insight is that the rigidity of the scene shares the same inherent geometrical structure with object movements and scene depth. Hence, rigidity from $\textbf{S}$ can be inferred by jointly coupling $\textbf{F}$, $\textbf{D}$ and $\textbf{P}$ to achieve more robust estimation. To this end, we propose a novel scene flow framework named EffiScene with efficient joint rigidity learning, going beyond existing pipeline with independent auxiliary structures. In EffiScene, we first estimate optical flow and depth at the coarse level and then compute camera pose by Perspective-$n$-Points method. To jointly learn local rigidity, we design a novel Rigidity From Motion (RfM) layer with three principal components: (i) correlation extraction; (ii) boundary learning; and (iii) outlier exclusion. Final outputs are fused based on the rigid map $M_R$ from RfM at finer level. To efficiently train EffiScene, two new losses $\mathcal{L}{bnd}$ and $\mathcal{L}{unc}$ are designed to prevent trivial solutions and to regularize the flow boundary discontinuity. Extensive experiments on scene flow benchmark KITTI show that our method is effective and significantly improves the state-of-the-art approaches for all sub-tasks, i.e. optical flow (5.19 $\rightarrow$ 4.20), depth estimation (3.78 $\rightarrow$ 3.46), visual odometry (0.012 $\rightarrow$ 0.011) and motion segmentation (0.57 $\rightarrow$ 0.62).

48.A New Similarity Space Tailored for Supervised Deep Metric Learning ⬇️

We propose a novel deep metric learning method. Differently from many works on this area, we defined a novel latent space obtained through an autoencoder. The new space, namely S-space, is divided into different regions that describe the positions where pairs of objects are similar/dissimilar. We locate makers to identify these regions. We estimate the similarities between objects through a kernel-based t-student distribution to measure the markers' distance and the new data representation. In our approach, we simultaneously estimate the markers' position in the S-space and represent the objects in the same space. Moreover, we propose a new regularization function to avoid similar markers to collapse altogether. We present evidences that our proposal can represent complex spaces, for instance, when groups similar objects are located in disjoint regions. We compare our proposal to 9 different distance metric learning approaches (four of them are based on deep-learning) on 28 real-world heterogeneous datasets. According to the four quantitative metrics used, our method overcomes all the nine strategies from the literature.

49.Feature Sharing and Integration for Cooperative Cognition and Perception with Volumetric Sensors ⬇️

The recent advancement in computational and communication systems has led to the introduction of high-performing neural networks and high-speed wireless vehicular communication networks. As a result, new technologies such as cooperative perception and cognition have emerged, addressing the inherent limitations of sensory devices by providing solutions for the detection of partially occluded targets and expanding the sensing range. However, designing a reliable cooperative cognition or perception system requires addressing the challenges caused by limited network resources and discrepancies between the data shared by different sources. In this paper, we examine the requirements, limitations, and performance of different cooperative perception techniques, and present an in-depth analysis of the notion of Deep Feature Sharing (DFS). We explore different cooperative object detection designs and evaluate their performance in terms of average precision. We use the Volony dataset for our experimental study. The results confirm that the DFS methods are significantly less sensitive to the localization error caused by GPS noise. Furthermore, the results attest that detection gain of DFS methods caused by adding more cooperative participants in the scenes is comparable to raw information sharing technique while DFS enables flexibility in design toward satisfying communication requirements.

50.Overcomplete Deep Subspace Clustering Networks ⬇️

Deep Subspace Clustering Networks (DSC) provide an efficient solution to the problem of unsupervised subspace clustering by using an undercomplete deep auto-encoder with a fully-connected layer to exploit the self expressiveness property. This method uses undercomplete representations of the input data which makes it not so robust and more dependent on pre-training. To overcome this, we propose a simple yet efficient alternative method - Overcomplete Deep Subspace Clustering Networks (ODSC) where we use overcomplete representations for subspace clustering. In our proposed method, we fuse the features from both undercomplete and overcomplete auto-encoder networks before passing them through the self-expressive layer thus enabling us to extract a more meaningful and robust representation of the input data for clustering. Experimental results on four benchmark datasets show the effectiveness of the proposed method over DSC and other clustering methods in terms of clustering error. Our method is also not as dependent as DSC is on where pre-training should be stopped to get the best performance and is also more robust to noise. Code - \href{this https URL}{this https URL

51.Where Are You? Localization from Embodied Dialog ⬇️

We present Where Are You? (WAY), a dataset of ~6k dialogs in which two humans -- an Observer and a Locator -- complete a cooperative localization task. The Observer is spawned at random in a 3D environment and can navigate from first-person views while answering questions from the Locator. The Locator must localize the Observer in a detailed top-down map by asking questions and giving instructions. Based on this dataset, we define three challenging tasks: Localization from Embodied Dialog or LED (localizing the Observer from dialog history), Embodied Visual Dialog (modeling the Observer), and Cooperative Localization (modeling both agents). In this paper, we focus on the LED task -- providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices. Our best model achieves 32.7% success at identifying the Observer's location within 3m in unseen buildings, vs. 70.4% for human Locators.

52.Assistive Diagnostic Tool for Brain Tumor Detection using Computer Vision ⬇️

Today, over 700,000 people are living with brain tumors in the United States. Brain tumors can spread very quickly to other parts of the brain and the spinal cord unless necessary preventive action is taken. Thus, the survival rate for this disease is less than 40% for both men and women. A conclusive and early diagnosis of a brain tumor could be the difference between life and death for some. However, brain tumor detection and segmentation are tedious and time-consuming processes as it can only be done by radiologists and clinical experts. The use of computer vision techniques, such as Mask R Convolutional Neural Network (Mask R CNN), to detect and segment brain tumors can mitigate the possibility of human error while increasing prediction accuracy rates. The goal of this project is to create an assistive diagnostics tool for brain tumor detection and segmentation. Transfer learning was used with the Mask R CNN, and necessary parameters were accordingly altered, as a starting point. The model was trained with 20 epochs and later tested. The prediction segmentation matched 90% with the ground truth. This suggests that the model was able to perform at a high level. Once the model was finalized, the application running on Flask was created. The application will serve as a tool for medical professionals. It allows doctors to upload patient brain tumor MRI images in order to receive immediate results on the diagnosis and segmentation for each patient.

53.Modality-Buffet for Real-Time Object Detection ⬇️

Real-time object detection in videos using lightweight hardware is a crucial component of many robotic tasks. Detectors using different modalities and with varying computational complexities offer different trade-offs. One option is to have a very lightweight model that can predict from all modalities at once for each frame. However, in some situations (e.g., in static scenes) it might be better to have a more complex but more accurate model and to extrapolate from previous predictions for the frames coming in at processing time. We formulate this task as a sequential decision making problem and use reinforcement learning (RL) to generate a policy that decides from the RGB input which detector out of a portfolio of different object detectors to take for the next prediction. The objective of the RL agent is to maximize the accuracy of the predictions per image. We evaluate the approach on the Waymo Open Dataset and show that it exceeds the performance of each single detector.

54.Denoising Score-Matching for Uncertainty Quantification in Inverse Problems ⬇️

Deep neural networks have proven extremely efficient at solving a wide rangeof inverse problems, but most often the uncertainty on the solution they provideis hard to quantify. In this work, we propose a generic Bayesian framework forsolving inverse problems, in which we limit the use of deep neural networks tolearning a prior distribution on the signals to recover. We adopt recent denoisingscore matching techniques to learn this prior from data, and subsequently use it aspart of an annealed Hamiltonian Monte-Carlo scheme to sample the full posteriorof image inverse problems. We apply this framework to Magnetic ResonanceImage (MRI) reconstruction and illustrate how this approach not only yields highquality reconstructions but can also be used to assess the uncertainty on particularfeatures of a reconstructed image.

55.Topology-Based Feature Design and Tracking for Multi-Center Cyclones ⬇️

In this paper, we propose a concept to design, track, and compare application-specific feature definitions expressed as sets of critical points. Our work has been inspired by the observation that in many applications a large variety of different feature definitions for the same concept are used. Often, these definitions compete with each other and it is unclear which definition should be used in which context. A prominent example is the definition of cyclones in climate research. Despite the differences, frequently these feature definitions can be related to topological concepts.
In our approach, we provide a cyclone tracking framework that supports interactive feature definition and comparison based on a precomputed tracking graph that stores all extremal points as well as their temporal correspondents. The framework combines a set of independent building blocks: critical point extraction, critical point tracking, feature definition, and track exploration. One of the major advantages of such an approach is the flexibility it provides, that is, each block is exchangeable. Moreover, it also enables us to perform the most expensive analysis, the construction of a full tracking graph, as a prepossessing step, while keeping the feature definition interactive. Different feature definitions can be explored and compared interactively based on this tracking graph. Features are specified by rules for grouping critical points, while feature tracking corresponds to filtering and querying the full tracking graph by specific requests. We demonstrate this method for cyclone identification and tracking in the context of climate research.

56.On Numerosity of Deep Neural Networks ⬇️

Recently, a provocative claim was published that number sense spontaneously emerges in a deep neural network trained merely for visual object recognition. This has, if true, far reaching significance to the fields of machine learning and cognitive science alike. In this paper, we prove the above claim to be unfortunately incorrect. The statistical analysis to support the claim is flawed in that the sample set used to identify number-aware neurons is too small, compared to the huge number of neurons in the object recognition network. By this flawed analysis one could mistakenly identify number-sensing neurons in any randomly initialized deep neural networks that are not trained at all. With the above critique we ask the question what if a deep convolutional neural network is carefully trained for numerosity? Our findings are mixed. Even after being trained with number-depicting images, the deep learning approach still has difficulties to acquire the abstract concept of numbers, a cognitive task that preschoolers perform with ease. But on the other hand, we do find some encouraging evidences suggesting that deep neural networks are more robust to distribution shift for small numbers than for large numbers.

57.Flame Stability Analysis of Flame Spray Pyrolysis by Artificial Intelligence ⬇️

Flame spray pyrolysis (FSP) is a process used to synthesize nanoparticles through the combustion of an atomized precursor solution; this process has applications in catalysts, battery materials, and pigments. Current limitations revolve around understanding how to consistently achieve a stable flame and the reliable production of nanoparticles. Machine learning and artificial intelligence algorithms that detect unstable flame conditions in real time may be a means of streamlining the synthesis process and improving FSP efficiency. In this study, the FSP flame stability is first quantified by analyzing the brightness of the flame's anchor point. This analysis is then used to label data for both unsupervised and supervised machine learning approaches. The unsupervised learning approach allows for autonomous labelling and classification of new data by representing data in a reduced dimensional space and identifying combinations of features that most effectively cluster it. The supervised learning approach, on the other hand, requires human labeling of training and test data, but is able to classify multiple objects of interest (such as the burner and pilot flames) within the video feed. The accuracy of each of these techniques is compared against the evaluations of human experts. Both the unsupervised and supervised approaches can track and classify FSP flame conditions in real time to alert users of unstable flame conditions. This research has the potential to autonomously track and manage flame spray pyrolysis as well as other flame technologies by monitoring and classifying the flame stability.

58.Lung Segmentation in Chest X-rays with Res-CR-Net ⬇️

Deep Neural Networks (DNN) are widely used to carry out segmentation tasks in biomedical images. Most DNNs developed for this purpose are based on some variation of the encoder-decoder U-Net architecture. Here we show that Res-CR-Net, a new type of fully convolutional neural network, which was originally developed for the semantic segmentation of microscopy images, and which does not adopt a U-Net architecture, is very effective at segmenting the lung fields in chest X-rays from either healthy patients or patients with a variety of lung pathologies.

59.Empowering Things with Intelligence: A Survey of the Progress, Challenges, and Opportunities in Artificial Intelligence of Things ⬇️

In the Internet of Things (IoT) era, billions of sensors and devices collect and process data from the environment, transmit them to cloud centers, and receive feedback via the internet for connectivity and perception. However, transmitting massive amounts of heterogeneous data, perceiving complex environments from these data, and then making smart decisions in a timely manner are difficult. Artificial intelligence (AI), especially deep learning, is now a proven success in various areas including computer vision, speech recognition, and natural language processing. AI introduced into the IoT heralds the era of artificial intelligence of things (AIoT). This paper presents a comprehensive survey on AIoT to show how AI can empower the IoT to make it faster, smarter, greener, and safer. Specifically, we briefly present the AIoT architecture in the context of cloud computing, fog computing, and edge computing. Then, we present progress in AI research for IoT from four perspectives: perceiving, learning, reasoning, and behaving. Next, we summarize some promising applications of AIoT that are likely to profoundly reshape our world. Finally, we highlight the challenges facing AIoT and some potential research opportunities.

60.Normalized Weighting Schemes for Image Interpolation Algorithms ⬇️

This paper presents and evaluates four weighting schemes for image interpolation algorithms. The first scheme is based on the normalized area of the circle, whose diameter is equal to the minimum side of a tetragon. The second scheme is based on the normalized area of the circle, whose radius is equal to the hypotenuse. The third scheme is based on the normalized area of the triangle, whose base and height are equal to the hypotenuse and virtual pixel length, respectively. The fourth weighting scheme is based on the normalized area of the circle, whose radius is equal to the virtual pixel length-based hypotenuse. Experiments demonstrated debatable algorithm performances and the need for further research.

61.Deep Learning Based HPV Status Prediction for Oropharyngeal Cancer Patients ⬇️

We investigated the ability of deep learning models for imaging based HPV status detection. To overcome the problem of small medical datasets we used a transfer learning approach. A 3D convolutional network pre-trained on sports video clips was fine tuned such that full 3D information in the CT images could be exploited. The video pre-trained model was able to differentiate HPV-positive from HPV-negative cases with an area under the receiver operating characteristic curve (AUC) of 0.81 for an external test set. In comparison to a 3D convolutional neural network (CNN) trained from scratch and a 2D architecture pre-trained on ImageNet the video pre-trained model performed best.

62.Structural and Functional Decomposition for Personality Image Captioning in a Communication Game ⬇️

Personality image captioning (PIC) aims to describe an image with a natural language caption given a personality trait. In this work, we introduce a novel formulation for PIC based on a communication game between a speaker and a listener. The speaker attempts to generate natural language captions while the listener encourages the generated captions to contain discriminative information about the input images and personality traits. In this way, we expect that the generated captions can be improved to naturally represent the images and express the traits. In addition, we propose to adapt the language model GPT2 to perform caption generation for PIC. This enables the speaker and listener to benefit from the language encoding capacity of GPT2. Our experiments show that the proposed model achieves the state-of-the-art performance for PIC.

63.Decision and Feature Level Fusion of Deep Features Extracted from Public COVID-19 Data-sets ⬇️

The Coronavirus (COVID-19), which is an infectious pulmonary disorder, has affected millions of people and has been declared as a global pandemic by the WHO. Due to highly contagious nature of COVID-19 and its high possibility of causing severe conditions in the patients, the development of rapid and accurate diagnostic tools have gained importance. The real-time reverse transcription-polymerize chain reaction (RT-PCR) is used to detect the presence of Coronavirus RNA by using the mucus and saliva mixture samples. But, RT-PCR suffers from having low-sensitivity especially in the early stage. Therefore, the usage of chest radiography has been increasing in the early diagnosis of COVID-19 due to its fast imaging speed, significantly low cost and low dosage exposure of radiation. In our study, a computer-aided diagnosis system for X-ray images based on convolutional neural networks (CNNs), which can be used by radiologists as a supporting tool in COVID-19 detection, has been proposed. Deep feature sets extracted by using CNNs were concatenated for feature level fusion and fed to multiple classifiers in terms of decision level fusion idea with the aim of discriminating COVID-19, pneumonia and no-finding classes. In the decision level fusion idea, a majority voting scheme was applied to the resultant decisions of classifiers. The obtained accuracy values and confusion matrix based evaluation criteria were presented for three progressively created data-sets. The aspects of the proposed method that are superior to existing COVID-19 detection studies have been discussed and the fusion performance of proposed approach was validated visually by using Class Activation Mapping technique. The experimental results show that the proposed approach has attained high COVID-19 detection performance that was proven by its comparable accuracy and superior precision/recall values with the existing studies.

64.Building Movie Map -- A Tool for Exploring Areas in a City -- and its Evaluation ⬇️

We propose a new Movie Map system, with an interface for exploring cities. The system consists of four stages; acquisition, analysis, management, and interaction. In the acquisition stage, omnidirectional videos are taken along streets in target areas. Frames of the video are localized on the map, intersections are detected, and videos are segmented. Turning views at intersections are subsequently generated. By connecting the video segments following the specified movement in an area, we can view the streets better. The interface allows for easy exploration of a target area, and it can show virtual billboards of stores in the view. We conducted user studies to compare our system to the GSV in a scenario where users could freely move and explore to find a landmark. The experiment showed that our system had a better user experience than GSV.

65.Assessing Wireless Sensing Potential with Large Intelligent Surfaces ⬇️

Sensing capability is one of the most highlighted new feature of future 6G wireless networks. This paper addresses the sensing potential of Large Intelligent Surfaces (LIS) in an exemplary Industry 4.0 scenario. Besides the attention received by LIS in terms of communication aspects, it can offer a high-resolution rendering of the propagation environment. This is because, in an indoor setting, it can be placed in proximity to the sensed phenomena, while the high resolution is offered by densely spaced tiny antennas deployed over a large area. By treating an LIS as a radio image of the environment relying on the received signal power, we develop techniques to sense the environment, by leveraging the tools of image processing and machine learning. Once a holographic image is obtained, a Denoising Autoencoder (DAE) network can be used for constructing a super-resolution image leading to sensing advantages not available in traditional sensing systems. Also, we derive a statistical test based on the Generalized Likelihood Ratio (GLRT) as a benchmark for the machine learning solution. We test these methods for a scenario where we need to detect whether an industrial robot deviates from a predefined route. The results show that the LIS-based sensing offers high precision and has a high application potential in indoor industrial environments.

66.AdCo: Adversarial Contrast for Efficient Learning of Unsupervised Representations from Self-Trained Negative Adversaries ⬇️

Contrastive learning relies on constructing a collection of negative examples that are sufficiently hard to discriminate against positive queries when their representations are self-trained. Existing contrastive learning methods either maintain a queue of negative samples over minibatches while only a small portion of them are updated in an iteration, or only use the other examples from the current minibatch as negatives. They could not closely track the change of the learned representation over iterations by updating the entire queue as a whole, or discard the useful information from the past minibatches. Alternatively, we present to directly learn a set of negative adversaries playing against the self-trained representation. Two players, the representation network and negative adversaries, are alternately updated to obtain the most challenging negative examples against which the representation of positive queries will be trained to discriminate. We further show that the negative adversaries are updated towards a weighted combination of positive queries by maximizing the adversarial contrastive loss, thereby allowing them to closely track the change of representations over time. Experiment results demonstrate the proposed Adversarial Contrastive (AdCo) model not only achieves superior performances with little computational overhead to the state-of-the-art contrast models, but also can be pretrained more rapidly with fewer epochs.

67.Sub-clusters of Normal Data for Anomaly Detection ⬇️

Anomaly detection in data analysis is an interesting but still challenging research topic in real world applications. As the complexity of data dimension increases, it requires to understand the semantic contexts in its description for effective anomaly characterization. However, existing anomaly detection methods show limited performances with high dimensional data such as ImageNet. Existing studies have evaluated their performance on low dimensional, clean and well separated data set such as MNIST and CIFAR-10. In this paper, we study anomaly detection with high dimensional and complex normal data. Our observation is that, in general, anomaly data is defined by semantically explainable features which are able to be used in defining semantic sub-clusters of normal data as well. We hypothesize that if there exists reasonably good feature space semantically separating sub-clusters of given normal data, unseen anomaly also can be well distinguished in the space from the normal data. We propose to perform semantic clustering on given normal data and train a classifier to learn the discriminative feature space where anomaly detection is finally performed. Based on our careful and extensive experimental evaluations with MNIST, CIFAR-10, and ImageNet with various combinations of normal and anomaly data, we show that our anomaly detection scheme outperforms state of the art methods especially with high dimensional real world images.

68.Robust Deep Learning with Active Noise Cancellation for Spatial Computing ⬇️

This paper proposes CANC, a Co-teaching Active Noise Cancellation method, applied in spatial computing to address deep learning trained with extreme noisy labels. Deep learning algorithms have been successful in spatial computing for land or building footprint recognition. However a lot of noise exists in ground truth labels due to how labels are collected in spatial computing and satellite imagery. Existing methods to deal with extreme label noise conduct clean sample selection and do not utilize the remaining samples. Such techniques can be wasteful due to the cost of data retrieval. Our proposed CANC algorithm not only conserves high-cost training samples but also provides active label correction to better improve robust deep learning with extreme noisy labels. We demonstrate the effectiveness of CANC for building footprint recognition for spatial computing.

69.Large-scale kernelized GRANGER causality to infer topology of directed graphs with applications to brain networks ⬇️

Graph topology inference of network processes with co-evolving and interacting time-series is crucial for network studies. Vector autoregressive models (VAR) are popular approaches for topology inference of directed graphs; however, in large networks with short time-series, topology estimation becomes ill-posed. The present paper proposes a novel nonlinearity-preserving topology inference method for directed networks with co-evolving nodal processes that solves the ill-posedness problem. The proposed method, large-scale kernelized Granger causality (lsKGC), uses kernel functions to transform data into a low-dimensional feature space and solves the autoregressive problem in the feature space, then finds the pre-images in the input space to infer the topology. Extensive simulations on synthetic datasets with nonlinear and linear dependencies and known ground-truth demonstrate significant improvement in the Area Under the receiver operating characteristic Curve ( AUC ) of the receiver operating characteristic for network recovery compared to existing methods. Furthermore, tests on real datasets from a functional magnetic resonance imaging (fMRI) study demonstrate 96.3 percent accuracy in diagnosis tasks of schizophrenia patients, which is the highest in the literature with only brain time-series information.

70.Deep Learning -- A first Meta-Survey of selected Reviews across Scientific Disciplines and their Research Impact ⬇️

Deep learning belongs to the field of artificial intelligence, where machines perform tasks that typically require some kind of human intelligence. Deep learning tries to achieve this by mimicking the learning of a human brain. Similar to the basic structure of a brain, which consists of (billions of) neurons and connections between them, a deep learning algorithm consists of an artificial neural network, which resembles the biological brain structure. Mimicking the learning process of humans with their senses, deep learning networks are fed with (sensory) data, like texts, images, videos or sounds. These networks outperform the state-of-the-art methods in different tasks and, because of this, the whole field saw an exponential growth during the last years. This growth resulted in way over 10 000 publications per year in the last years. For example, the search engine PubMed alone, which covers only a sub-set of all publications in the medical field, provides over 11 000 results for the search term $'$deep learning$'$ in Q3 2020, and ~90% of these results are from the last three years. Consequently, a complete overview over the field of deep learning is already impossible to obtain and, in the near future, it will potentially become difficult to obtain an overview over a subfield. However, there are several review articles about deep learning, which are focused on specific scientific fields or applications, for example deep learning advances in computer vision or in specific tasks like object detection. With these surveys as a foundation, the aim of this contribution is to provide a first high-level, categorized meta-analysis of selected reviews on deep learning across different scientific disciplines and outline the research impact that they already have during a short period of time.