ArXiv cs.CV --Wed, 11 Nov 2020

1.Learning to Communicate and Correct Pose Errors ⬇️

Learned communication makes multi-agent systems more effective by aggregating distributed information. However, it also exposes individual agents to the threat of erroneous messages they might receive. In this paper, we study the setting proposed in V2VNet, where nearby self-driving vehicles jointly perform object detection and motion forecasting in a cooperative manner. Despite a huge performance boost when the agents solve the task together, the gain is quickly diminished in the presence of pose noise since the communication relies on spatial transformations. Hence, we propose a novel neural reasoning framework that learns to communicate, to estimate potential errors, and finally, to reach a consensus about those errors. Experiments confirm that our proposed framework significantly improves the robustness of multi-agent self-driving perception and motion forecasting systems under realistic and severe localization noise.

2.Perception Improvement for Free: Exploring Imperceptible Black-box Adversarial Attacks on Image Classification ⬇️

Deep neural networks are vulnerable to adversarial attacks. White-box adversarial attacks can fool neural networks with small adversarial perturbations, especially for large size images. However, keeping successful adversarial perturbations imperceptible is especially challenging for transfer-based black-box adversarial attacks. Often such adversarial examples can be easily spotted due to their unpleasantly poor visual qualities, which compromises the threat of adversarial attacks in practice. In this study, to improve the image quality of black-box adversarial examples perceptually, we propose structure-aware adversarial attacks by generating adversarial images based on psychological perceptual models. Specifically, we allow higher perturbations on perceptually insignificant regions, while assigning lower or no perturbation on visually sensitive regions. In addition to the proposed spatial-constrained adversarial perturbations, we also propose a novel structure-aware frequency adversarial attack method in the discrete cosine transform (DCT) domain. Since the proposed attacks are independent of the gradient estimation, they can be directly incorporated with existing gradient-based attacks. Experimental results show that, with the comparable attack success rate (ASR), the proposed methods can produce adversarial examples with considerably improved visual quality for free. With the comparable perceptual quality, the proposed approaches achieve higher attack success rates: particularly for the frequency structure-aware attacks, the average ASR improves more than 10% over the baseline attacks.

3.Classification of Polarimetric SAR Images Using Compact Convolutional Neural Networks ⬇️

Classification of polarimetric synthetic aperture radar (PolSAR) images is an active research area with a major role in environmental applications. The traditional Machine Learning (ML) methods proposed in this domain generally focus on utilizing highly discriminative features to improve the classification performance, but this task is complicated by the well-known "curse of dimensionality" phenomena. Other approaches based on deep Convolutional Neural Networks (CNNs) have certain limitations and drawbacks, such as high computational complexity, an unfeasibly large training set with ground-truth labels, and special hardware requirements. In this work, to address the limitations of traditional ML and deep CNN based methods, a novel and systematic classification framework is proposed for the classification of PolSAR images, based on a compact and adaptive implementation of CNNs using a sliding-window classification approach. The proposed approach has three advantages. First, there is no requirement for an extensive feature extraction process. Second, it is computationally efficient due to utilized compact configurations. In particular, the proposed compact and adaptive CNN model is designed to achieve the maximum classification accuracy with minimum training and computational complexity. This is of considerable importance considering the high costs involved in labelling in PolSAR classification. Finally, the proposed approach can perform classification using smaller window sizes than deep CNNs. Experimental evaluations have been performed over the most commonly-used four benchmark PolSAR images: AIRSAR L-Band and RADARSAT-2 C-Band data of San Francisco Bay and Flevoland areas. Accordingly, the best obtained overall accuracies range between 92.33 - 99.39% for these benchmark study sites.

4.Temporal Stochastic Softmax for 3D CNNs: An Application in Facial Expression Recognition ⬇️

Training deep learning models for accurate spatiotemporal recognition of facial expressions in videos requires significant computational resources. For practical reasons, 3D Convolutional Neural Networks (3D CNNs) are usually trained with relatively short clips randomly extracted from videos. However, such uniform sampling is generally sub-optimal because equal importance is assigned to each temporal clip. In this paper, we present a strategy for efficient video-based training of 3D CNNs. It relies on softmax temporal pooling and a weighted sampling mechanism to select the most relevant training clips. The proposed softmax strategy provides several advantages: a reduced computational complexity due to efficient clip sampling, and an improved accuracy since temporal weighting focuses on more relevant clips during both training and inference. Experimental results obtained with the proposed method on several facial expression recognition benchmarks show the benefits of focusing on more informative clips in training videos. In particular, our approach improves performance and computational cost by reducing the impact of inaccurate trimming and coarse annotation of videos, and heterogeneous distribution of visual information across time.

5.Pixel precise unsupervised detection of viral particle proliferation in cellular imaging data ⬇️

Cellular and molecular imaging techniques and models have been developed to characterize single stages of viral proliferation after focal infection of cells in vitro. The fast and automatic classification of cell imaging data may prove helpful prior to any further comparison of representative experimental data to mathematical models of viral propagation in host cells. Here, we use computer generated images drawn from a reproduction of an imaging model from a previously published study of experimentally obtained cell imaging data representing progressive viral particle proliferation in host cell monolayers. Inspired by experimental time-based imaging data, here in this study viral particle increase in time is simulated by a one-by-one increase, across images, in black or gray single pixels representing dead or partially infected cells, and hypothetical remission by a one-by-one increase in white pixels coding for living cells in the original image model. The image simulations are submitted to unsupervised learning by a Self-Organizing Map (SOM) and the Quantization Error in the SOM output (SOM-QE) is used for automatic classification of the image simulations as a function of the represented extent of viral particle proliferation or cell recovery. Unsupervised classification by SOM-QE of 160 model images, each with more than three million pixels, is shown to provide a statistically reliable, pixel precise, and fast classification model that outperforms human computer-assisted image classification by RGB image mean computation. The automatic classification procedure proposed here provides a powerful approach to understand finely tuned mechanisms in the infection and proliferation of virus in cell lines in vitro or other cells.

6.A Multi-Plant Disease Diagnosis Method using Convolutional Neural Network ⬇️

A disease that limits a plant from its maximal capacity is defined as plant disease. From the perspective of agriculture, diagnosing plant disease is crucial, as diseases often limit plants' production capacity. However, manual approaches to recognize plant diseases are often temporal, challenging, and time-consuming. Therefore, computerized recognition of plant diseases is highly desired in the field of agricultural automation. Due to the recent improvement of computer vision, identifying diseases using leaf images of a particular plant has already been introduced. Nevertheless, the most introduced model can only diagnose diseases of a specific plant. Hence, in this chapter, we investigate an optimal plant disease identification model combining the diagnosis of multiple plants. Despite relying on multi-class classification, the model inherits a multilabel classification method to identify the plant and the type of disease in parallel. For the experiment and evaluation, we collected data from various online sources that included leaf images of six plants, including tomato, potato, rice, corn, grape, and apple. In our investigation, we implement numerous popular convolutional neural network (CNN) architectures. The experimental results validate that the Xception and DenseNet architectures perform better in multi-label plant disease classification tasks. Through architectural investigation, we imply that skip connections, spatial convolutions, and shorter hidden layer connectivity cause better results in plant disease classification.

7.Multi-modal, multi-task, multi-attention (M3) deep learning detection of reticular pseudodrusen: 1 towards automated and accessible classification of age-related macular degeneration ⬇️

Objective Reticular pseudodrusen (RPD), a key feature of age-related macular degeneration (AMD), are poorly detected by human experts on standard color fundus photography (CFP) and typically require advanced imaging modalities such as fundus autofluorescence (FAF). The objective was to develop and evaluate the performance of a novel 'M3' deep learning framework on RPD detection. Materials and Methods A deep learning framework M3 was developed to detect RPD presence accurately using CFP alone, FAF alone, or both, employing >8000 CFP-FAF image pairs obtained prospectively (Age-Related Eye Disease Study 2). The M3 framework includes multi-modal (detection from single or multiple image modalities), multi-task (training different tasks simultaneously to improve generalizability), and multi-attention (improving ensembled feature representation) operation. Performance on RPD detection was compared with state-of-the-art deep learning models and 13 ophthalmologists; performance on detection of two other AMD features (geographic atrophy and pigmentary abnormalities) was also evaluated. Results For RPD detection, M3 achieved area under receiver operating characteristic (AUROC) 0.832, 0.931, and 0.933 for CFP alone, FAF alone, and both, respectively. M3 performance on CFP was very substantially superior to human retinal specialists (median F1-score 0.644 versus 0.350). External validation (on Rotterdam Study, Netherlands) demonstrated high accuracy on CFP alone (AUROC 0.965). The M3 framework also accurately detected geographic atrophy and pigmentary abnormalities (AUROC 0.909 and 0.912, respectively), demonstrating its generalizability. Conclusion This study demonstrates the successful development, robust evaluation, and external validation of a novel deep learning framework that enables accessible, accurate, and automated AMD diagnosis and prognosis.

8.Multi-pooled Inception features for no-reference image quality assessment ⬇️

Image quality assessment (IQA) is an important element of a broad spectrum of applications ranging from automatic video streaming to display technology. Furthermore, the measurement of image quality requires a balanced investigation of image content and features. Our proposed approach extracts visual features by attaching global average pooling (GAP) layers to multiple Inception modules of on an ImageNet database pretrained convolutional neural network (CNN). In contrast to previous methods, we do not take patches from the input image. Instead, the input image is treated as a whole and is run through a pretrained CNN body to extract resolution-independent, multi-level deep features. As a consequence, our method can be easily generalized to any input image size and pretrained CNNs. Thus, we present a detailed parameter study with respect to the CNN base architectures and the effectiveness of different deep features. We demonstrate that our best proposal - called MultiGAP-NRIQA - is able to provide state-of-the-art results on three benchmark IQA databases. Furthermore, these results were also confirmed in a cross database test using the LIVE In the Wild Image Quality Challenge database.

9.On-Device Language Identification of Text in Images using Diacritic Characters ⬇️

Diacritic characters can be considered as a unique set of characters providing us with adequate and significant clue in identifying a given language with considerably high accuracy. Diacritics, though associated with phonetics often serve as a distinguishing feature for many languages especially the ones with a Latin script. In this proposed work, we aim to identify language of text in images using the presence of diacritic characters in order to improve Optical Character Recognition (OCR) performance in any given automated environment. We showcase our work across 13 Latin languages encompassing 85 diacritic characters. We use an architecture similar to Squeezedet for object detection of diacritic characters followed by a shallow network to finally identify the language. OCR systems when accompanied with identified language parameter tends to produce better results than sole deployment of OCR systems. The discussed work apart from guaranteeing an improvement in OCR results also takes on-device (mobile phone) constraints into consideration in terms of model size and inference time.

10.MP-ResNet: Multi-path Residual Network for the Semantic segmentation of High-Resolution PolSAR Images ⬇️

There are limited studies on the semantic segmentation of high-resolution Polarimetric Synthetic Aperture Radar (PolSAR) images due to the scarcity of training data and the inference of speckle noises. The Gaofen contest has provided open access of a high-quality PolSAR semantic segmentation dataset. Taking this chance, we propose a Multi-path ResNet (MP-ResNet) architecture for the semantic segmentation of high-resolution PolSAR images. Compared to conventional U-shape encoder-decoder convolutional neural network (CNN) architectures, the MP-ResNet learns semantic context with its parallel multi-scale branches, which greatly enlarges its valid receptive fields and improves the embedding of local discriminative features. In addition, MP-ResNet adopts a multi-level feature fusion design in its decoder to make the best use of the features learned from its different branches. Ablation studies show that the MPResNet has significant advantages over its baseline method (FCN with ResNet34). It also surpasses several classic state-of-the-art methods in terms of overall accuracy (OA), mean F1 and fwIoU, whereas its computational costs are not much increased. This CNN architecture can be used as a baseline method for future studies on the semantic segmentation of PolSAR images. The code is available at: this https URL.

11.Decoupled Appearance and Motion Learning for Efficient Anomaly Detection in Surveillance Video ⬇️

Automating the analysis of surveillance video footage is of great interest when urban environments or industrial sites are monitored by a large number of cameras. As anomalies are often context-specific, it is hard to predefine events of interest and collect labelled training data. A purely unsupervised approach for automated anomaly detection is much more suitable. For every camera, a separate algorithm could then be deployed that learns over time a baseline model of appearance and motion related features of the objects within the camera viewport. Anything that deviates from this baseline is flagged as an anomaly for further analysis downstream. We propose a new neural network architecture that learns the normal behavior in a purely unsupervised fashion. In contrast to previous work, we use latent code predictions as our anomaly metric. We show that this outperforms reconstruction-based and frame prediction-based methods on different benchmark datasets both in terms of accuracy and robustness against changing lighting and weather conditions. By decoupling an appearance and a motion model, our model can also process 16 to 45 times more frames per second than related approaches which makes our model suitable for deploying on the camera itself or on other edge devices.

12.Human-centric Spatio-Temporal Video Grounding With Visual Transformers ⬇️

In this work, we introduce a novel task - Humancentric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatiotemporal tube of the target person from an untrimmed video based on a given textural description. This task is useful, especially for healthcare and security-related applications, where the surveillance videos can be extremely long but only a specific person during a specific period of time is concerned. HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization. Unfortunately, the existing grounding methods cannot handle this task well. We tackle this task by proposing an effective baseline method named Spatio-Temporal Grounding with Visual Transformers (STGVT), which utilizes Visual Transformers to extract cross-modal representations for video-sentence matching and temporal localization. To facilitate this task, we also contribute an HC-STVG dataset consisting of 5,660 video-sentence pairs on complex multi-person scenes. Specifically, each video lasts for 20 seconds, pairing with a natural query sentence with an average of 17.25 words. Extensive experiments are conducted on this dataset, demonstrating the newly-proposed method outperforms the existing baseline methods.

13.Point Cloud Registration Based on Consistency Evaluation of Rigid Transformation in Parameter Space ⬇️

We can use a method called registration to integrate some point clouds that represent the shape of the real world. In this paper, we propose highly accurate and stable registration method. Our method detects keypoints from point clouds and generates triplets using multiple descriptors. Furthermore, our method evaluates the consistency of rigid transformation parameters of each triplet with histograms and obtains the rigid transformation between the point clouds. In the experiment of this paper, our method had minimul errors and no major failures. As a result, we obtained sufficiently accurate and stable registration results compared to the comparative methods.

14.Residual Pose: A Decoupled Approach for Depth-based 3D Human Pose Estimation ⬇️

We propose to leverage recent advances in reliable 2D pose estimation with Convolutional Neural Networks (CNN) to estimate the 3D pose of people from depth images in multi-person Human-Robot Interaction (HRI) scenarios. Our method is based on the observation that using the depth information to obtain 3D lifted points from 2D body landmark detections provides a rough estimate of the true 3D human pose, thus requiring only a refinement step. In that line our contributions are threefold. (i) we propose to perform 3D pose estimation from depth images by decoupling 2D pose estimation and 3D pose refinement; (ii) we propose a deep-learning approach that regresses the residual pose between the lifted 3D pose and the true 3D pose; (iii) we show that despite its simplicity, our approach achieves very competitive results both in accuracy and speed on two public datasets and is therefore appealing for multi-person HRI compared to recent state-of-the-art methods.

15.Deep Multimodal Fusion by Channel Exchanging ⬇️

Deep multimodal fusion by using multiple sources of data for classification or regression has exhibited a clear advantage over the unimodal counterpart on various applications. Yet, current methods including aggregation-based and alignment-based fusion are still inadequate in balancing the trade-off between inter-modal fusion and intra-modal processing, incurring a bottleneck of performance improvement. To this end, this paper proposes Channel-Exchanging-Network (CEN), a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities. Specifically, the channel exchanging process is self-guided by individual channel importance that is measured by the magnitude of Batch-Normalization (BN) scaling factor during training. The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network. Extensive experiments on semantic segmentation via RGB-D data and image translation through multi-domain input verify the effectiveness of our CEN compared to current state-of-the-art methods. Detailed ablation studies have also been carried out, which provably affirm the advantage of each component we propose. Our code is available at this https URL.

16.Joint Super-Resolution and Rectification for Solar Cell Inspection ⬇️

Visual inspection of solar modules is an important monitoring facility in photovoltaic power plants. Since a single measurement of fast CMOS sensors is limited in spatial resolution and often not sufficient to reliably detect small defects, we apply multi-frame super-resolution (MFSR) to a sequence of low resolution measurements. In addition, the rectification and removal of lens distortion simplifies subsequent analysis. Therefore, we propose to fuse this pre-processing with standard MFSR algorithms. This is advantageous, because we omit a separate processing step, the motion estimation becomes more stable and the spacing of high-resolution (HR) pixels on the rectified module image becomes uniform w.r.t. the module plane, regardless of perspective distortion. We present a comprehensive user study showing that MFSR is beneficial for defect recognition by human experts and that the proposed method performs better than the state of the art. Furthermore, we apply automated crack segmentation and show that the proposed method performs 3x better than bicubic upsampling and 2x better than the state of the art for automated inspection.

17.Removing Brightness Bias in Rectified Gradients ⬇️

Interpretation and improvement of deep neural networks relies on better understanding of their underlying mechanisms. In particular, gradients of classes or concepts with respect to the input features (e.g., pixels in images) are often used as importance scores, which are visualized in saliency maps. Thus, a family of saliency methods provide an intuitive way to identify input features with substantial influences on classifications or latent concepts. Rectified Gradients \cite{Kim2019} is a new method which introduce layer-wise thresholding in order to denoise the saliency maps. While visually coherent in certain cases, we identify a brightness bias in Rectified Gradients. We demonstrate that dark areas of an input image are not highlighted by a saliency map using Rectified Gradients, even if it is relevant for the class or concept. Even in the scaled images, the bias exists around an artificial point in color spectrum. Our simple modification removes this bias and recovers input features that were removed due to their colors.
"No Bias Rectified Gradient" is available at \url{this https URL}

18.AIM 2020 Challenge on Learned Image Signal Processing Pipeline ⬇️

This paper reviews the second AIM learned ISP challenge and provides the description of the proposed solutions and results. The participating teams were solving a real-world RAW-to-RGB mapping problem, where to goal was to map the original low-quality RAW images captured by the Huawei P20 device to the same photos obtained with the Canon 5D DSLR camera. The considered task embraced a number of complex computer vision subtasks, such as image demosaicing, denoising, white balancing, color and contrast correction, demoireing, etc. The target metric used in this challenge combined fidelity scores (PSNR and SSIM) with solutions' perceptual results measured in a user study. The proposed solutions significantly improved the baseline results, defining the state-of-the-art for practical image signal processing pipeline modeling.

19.SelfDeco: Self-Supervised Monocular Depth Completion in Challenging Indoor Environments ⬇️

We present a novel algorithm for self-supervised monocular depth completion. Our approach is based on training a neural network that requires only sparse depth measurements and corresponding monocular video sequences without dense depth labels. Our self-supervised algorithm is designed for challenging indoor environments with textureless regions, glossy and transparent surface, non-Lambertian surfaces, moving people, longer and diverse depth ranges and scenes captured by complex ego-motions. Our novel architecture leverages both deep stacks of sparse convolution blocks to extract sparse depth features and pixel-adaptive convolutions to fuse image and depth features. We compare with existing approaches in NYUv2, KITTI and NAVERLABS indoor datasets, and observe 5:-:34 % improvements in root-means-square error (RMSE) reduction.

20.Conceptual Compression via Deep Structure and Texture Synthesis ⬇️

Existing compression methods typically focus on the removal of signal-level redundancies, while the potential and versatility of decomposing visual data into compact conceptual components still lack further study. To this end, we propose a novel conceptual compression framework that encodes visual data into compact structure and texture representations, then decodes in a deep synthesis fashion, aiming to achieve better visual reconstruction quality, flexible content manipulation, and potential support for various vision tasks. In particular, we propose to compress images by a dual-layered model consisting of two complementary visual features: 1) structure layer represented by structural maps and 2) texture layer characterized by low-dimensional deep representations. At the encoder side, the structural maps and texture representations are individually extracted and compressed, generating the compact, interpretable, inter-operable bitstreams. During the decoding stage, a hierarchical fusion GAN (HF-GAN) is proposed to learn the synthesis paradigm where the textures are rendered into the decoded structural maps, leading to high-quality reconstruction with remarkable visual realism. Extensive experiments on diverse images have demonstrated the superiority of our framework with lower bitrates, higher reconstruction quality, and increased versatility towards visual analysis and content manipulation tasks.

21.Detecting Human-Object Interaction with Mixed Supervision ⬇️

Human object interaction (HOI) detection is an important task in image understanding and reasoning. It is in a form of HOI triplet hhuman; verb; objecti, requiring bounding boxes for human and object, and action between them for the task completion. In other words, this task requires strong supervision for training that is however hard to procure. A natural solution to overcome this is to pursue weakly-supervised learning, where we only know the presence of certain HOI triplets in images but their exact location is unknown. Most weakly-supervised learning methods do not make provision for leveraging data with strong supervision, when they are available; and indeed a naive combination of this two paradigms in HOI detection fails to make contributions to each other. In this regard we propose a mixed-supervised HOI detection pipeline: thanks to a specific design of momentum-independent learning that learns seamlessly across these two types of supervision. Moreover, in light of the annotation insufficiency in mixed supervision, we introduce an HOI element swapping technique to synthesize diverse and hard negatives across images and improve the robustness of the model. Our method is evaluated on the challenging HICO-DET dataset. It performs close to or even better than many fully-supervised methods by using a mixed amount of strong and weak annotations; furthermore, it outperforms representative state of the art weaklyand fully-supervised methods under the same supervision.

22.Unsupervised Contrastive Photo-to-Caricature Translation based on Auto-distortion ⬇️

Photo-to-caricature translation aims to synthesize the caricature as a rendered image exaggerating the features through sketching, pencil strokes, or other artistic drawings. Style rendering and geometry deformation are the most important aspects in photo-to-caricature translation task. To take both into consideration, we propose an unsupervised contrastive photo-to-caricature translation architecture. Considering the intuitive artifacts in the existing methods, we propose a contrastive style loss for style rendering to enforce the similarity between the style of rendered photo and the caricature, and simultaneously enhance its discrepancy to the photos. To obtain an exaggerating deformation in an unpaired/unsupervised fashion, we propose a Distortion Prediction Module (DPM) to predict a set of displacements vectors for each input image while fixing some controlling points, followed by the thin plate spline interpolation for warping. The model is trained on unpaired photo and caricature while can offer bidirectional synthesizing via inputting either a photo or a caricature. Extensive experiments demonstrate that the proposed model is effective to generate hand-drawn like caricatures compared with existing competitors.

23.Multi-modal Fusion for Single-Stage Continuous Gesture Recognition ⬇️

Gesture recognition is a much studied research area which has myriad real-world applications including robotics and human-machine interaction. Current gesture recognition methods have heavily focused on isolated gestures, and existing continuous gesture recognition methods are limited by a two-stage approach where independent models are required for detection and classification, with the performance of the latter being constrained by detection performance. In contrast, we introduce a single-stage continuous gesture recognition model, that can detect and classify multiple gestures in a single video via a single model. This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation stage to detect individual gestures. To enable this, we introduce a multi-modal fusion mechanism to support the integration of important information that flows from multi-modal inputs, and is scalable to any number of modes. Additionally, we propose Unimodal Feature Mapping (UFM) and Multi-modal Feature Mapping (MFM) models to map uni-modal features and the fused multi-modal features respectively. To further enhance the performance we propose a mid-point based loss function that encourages smooth alignment between the ground truth and the prediction. We demonstrate the utility of our proposed framework which can handle variable-length input videos, and outperforms the state-of-the-art on two challenging datasets, EgoGesture, and IPN hand. Furthermore, ablative experiments show the importance of different components of the proposed framework.

24.Simple means Faster: Real-Time Human Motion Forecasting in Monocular First Person Videos on CPU ⬇️

We present a simple, fast, and light-weight RNN based framework for forecasting future locations of humans in first person monocular videos. The primary motivation for this work was to design a network which could accurately predict future trajectories at a very high rate on a CPU. Typical applications of such a system would be a social robot or a visual assistance system for all, as both cannot afford to have high compute power to avoid getting heavier, less power efficient, and costlier. In contrast to many previous methods which rely on multiple type of cues such as camera ego-motion or 2D pose of the human, we show that a carefully designed network model which relies solely on bounding boxes can not only perform better but also predicts trajectories at a very high rate while being quite low in size of approximately 17 MB. Specifically, we demonstrate that having an auto-encoder in the encoding phase of the past information and a regularizing layer in the end boosts the accuracy of predictions with negligible overhead. We experiment with three first person video datasets: CityWalks, FPL and JAAD. Our simple method trained on CityWalks surpasses the prediction accuracy of state-of-the-art method (STED) while being 9.6x faster on a CPU (STED runs on a GPU). We also demonstrate that our model can transfer zero-shot or after just 15% fine-tuning to other similar datasets and perform on par with the state-of-the-art methods on such datasets (FPL and DTP). To the best of our knowledge, we are the first to accurately forecast trajectories at a very high prediction rate of 78 trajectories per second on CPU.

25.Stage-wise Channel Pruning for Model Compression ⬇️

Auto-ML pruning methods aim at searching a pruning strategy automatically to reduce the computational complexity of deep Convolutional Neural Networks(deep CNNs). However, some previous works found that the results of many Auto-ML pruning methods even cannot surpass the results of the uniformly pruning method. In this paper, we first analyze the reason for the ineffectiveness of Auto-ML pruning. Subsequently, a stage-wise pruning(SP) method is proposed to solve the above problem. As with most of the previous Auto-ML pruning methods, SP also trains a super-net that can provide proxy performance for sub-nets and search the best sub-net who has the best proxy performance. Different from previous works, we split a deep CNN into several stages and use a full-net where all layers are not pruned to supervise the training and the searching of sub-nets. Remarkably, the proxy performance of sub-nets trained with SP is closer to the actual performance than most of the previous Auto-ML pruning works. Therefore, SP achieves the state-of-the-art on both CIFAR-10 and ImageNet under the mobile setting.

26.CoADNet: Collaborative Aggregation-and-Distribution Networks for Co-Salient Object Detection ⬇️

Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images. One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships. In this paper, we present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images. First, we integrate saliency priors into the backbone features to suppress the redundant background information through an online intra-saliency guidance structure. After that, we design a two-stage aggregate-and-distribute architecture to explore group-wise semantic interactions and produce the co-saliency features. In the first stage, we propose a group-attentional semantic aggregation module that models inter-image relationships to generate the group-wise semantic representations. In the second stage, we propose a gated group distribution module that adaptively distributes the learned group semantics to different individuals in a dynamic gating mechanism. Finally, we develop a group consistency preserving decoder tailored for the CoSOD task, which maintains group constraints during feature decoding to predict more consistent full-resolution co-saliency maps. The proposed CoADNet is evaluated on four prevailing CoSOD benchmark datasets, which demonstrates the remarkable performance improvement over ten state-of-the-art competitors.

27.A low latency ASR-free end to end spoken language understanding system ⬇️

In recent years, developing a speech understanding system that classifies a waveform to structured data, such as intents and slots, without first transcribing the speech to text has emerged as an interesting research problem. This work proposes such as system with an additional constraint of designing a system that has a small enough footprint to run on small micro-controllers and embedded systems with minimal latency. Given a streaming input speech signal, the proposed system can process it segment-by-segment without the need to have the entire stream at the moment of processing. The proposed system is evaluated on the publicly available Fluent Speech Commands dataset. Experiments show that the proposed system yields state-of-the-art performance with the advantage of low latency and a much smaller model when compared to other published works on the same task.

28.STCNet: Spatio-Temporal Cross Network for Industrial Smoke Detection ⬇️

Industrial smoke emissions present a serious threat to natural ecosystems and human health. Prior works have shown that using computer vision techniques to identify smoke is a low cost and convenient method. However, industrial smoke detection is a challenging task because industrial emission particles are often decay rapidly outside the stacks or facilities and steam is very similar to smoke. To overcome these problems, a novel Spatio-Temporal Cross Network (STCNet) is proposed to recognize industrial smoke emissions. The proposed STCNet involves a spatial pathway to extract texture features and a temporal pathway to capture smoke motion information. We assume that spatial and temporal pathway could guide each other. For example, the spatial path can easily recognize the obvious interference such as trees and buildings, and the temporal path can highlight the obscure traces of smoke movement. If the two pathways could guide each other, it will be helpful for the smoke detection performance. In addition, we design an efficient and concise spatio-temporal dual pyramid architecture to ensure better fusion of multi-scale spatiotemporal information. Finally, extensive experiments on public dataset show that our STCNet achieves clear improvements on the challenging RISE industrial smoke detection dataset against the best competitors by 6.2%. The code will be available at: this https URL.

29.On Efficient and Robust Metrics for RANSAC Hypotheses and 3D Rigid Registration ⬇️

This paper focuses on developing efficient and robust evaluation metrics for RANSAC hypotheses to achieve accurate 3D rigid registration. Estimating six-degree-of-freedom (6-DoF) pose from feature correspondences remains a popular approach to 3D rigid registration, where random sample consensus (RANSAC) is a de-facto choice to this problem. However, existing metrics for RANSAC hypotheses are either time-consuming or sensitive to common nuisances, parameter variations, and different application scenarios, resulting in performance deterioration in overall registration accuracy and speed. We alleviate this problem by first analyzing the contributions of inliers and outliers, and then proposing several efficient and robust metrics with different designing motivations for RANSAC hypotheses. Comparative experiments on four standard datasets with different nuisances and application scenarios verify that the proposed metrics can significantly improve the registration performance and are more robust than several state-of-the-art competitors, making them good gifts to practical applications. This work also draws an interesting conclusion, i.e., not all inliers are equal while all outliers should be equal, which may shed new light on this research problem.

30.Understanding the hand-gestures using Convolutional Neural Networks and Generative Adversial Networks ⬇️

In this paper, it is introduced a hand gesture recognition system to recognize the characters in the real time. The system consists of three modules: real time hand tracking, training gesture and gesture recognition using Convolutional Neural Networks. Camshift algorithm and hand blobs analysis for hand tracking are being used to obtain motion descriptors and hand region. It is fairy robust to background cluster and uses skin color for hand gesture tracking and recognition. Furthermore, the techniques have been proposed to improve the performance of the recognition and the accuracy using the approaches like selection of the training images and the adaptive threshold gesture to remove non-gesture pattern that helps to qualify an input pattern as a gesture. In the experiments, it has been tested to the vocabulary of 36 gestures including the alphabets and digits, and results effectiveness of the approach.

31.Social-STAGE: Spatio-Temporal Multi-Modal Future Trajectory Forecast ⬇️

This paper considers the problem of multi-modal future trajectory forecast with ranking. Here, multi-modality and ranking refer to the multiple plausible path predictions and the confidence in those predictions, respectively. We propose Social-STAGE, Social interaction-aware Spatio-Temporal multi-Attention Graph convolution network with novel Evaluation for multi-modality. Our main contributions include analysis and formulation of multi-modality with ranking using interaction and multi-attention, and introduction of new metrics to evaluate the diversity and associated confidence of multi-modal predictions. We evaluate our approach on existing public datasets ETH and UCY and show that the proposed algorithm outperforms the state of the arts on these datasets.

32.Ellipse Detection and Localization with Applications to Knots in Sawn Lumber Images ⬇️

While general object detection has seen tremendous progress, localization of elliptical objects has received little attention in the literature. Our motivating application is the detection of knots in sawn timber images, which is an important problem since the number and types of knots are visual characteristics that adversely affect the quality of sawn timber. We demonstrate how models can be tailored to the elliptical shape and thereby improve on general purpose detectors; more generally, elliptical defects are common in industrial production, such as enclosed air bubbles when casting glass or plastic. In this paper, we adapt the Faster R-CNN with its Region Proposal Network (RPN) to model elliptical objects with a Gaussian function, and extend the existing Gaussian Proposal Network (GPN) architecture by adding the region-of-interest pooling and regression branches, as well as using the Wasserstein distance as the loss function to predict the precise locations of elliptical objects. Our proposed method has promising results on the lumber knot dataset: knots are detected with an average intersection over union of 73.05%, compared to 63.63% for general purpose detectors. Specific to the lumber application, we also propose an algorithm to correct any misalignment in the raw timber images during scanning, and contribute the first open-source lumber knot dataset by labeling the elliptical knots in the preprocessed images.

33.CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection ⬇️

The perception system in autonomous vehicles is responsible for detecting and tracking the surrounding objects. This is usually done by taking advantage of several sensing modalities to increase robustness and accuracy, which makes sensor fusion a crucial part of the perception system. In this paper, we focus on the problem of radar and camera sensor fusion and propose a middle-fusion approach to exploit both radar and camera data for 3D object detection. Our approach, called CenterFusion, first uses a center point detection network to detect objects by identifying their center points on the image. It then solves the key data association problem using a novel frustum-based method to associate the radar detections to their corresponding object's center point. The associated radar detections are used to generate radar-based feature maps to complement the image features, and regress to object properties such as depth, rotation and velocity. We evaluate CenterFusion on the challenging nuScenes dataset, where it improves the overall nuScenes Detection Score (NDS) of the state-of-the-art camera-based algorithm by more than 12%. We further show that CenterFusion significantly improves the velocity estimation accuracy without using any additional temporal information. The code is available at this https URL .

34.Kinematics-Guided Reinforcement Learning for Object-Aware 3D Ego-Pose Estimation ⬇️

We propose a method for incorporating object interaction and human body dynamics into the task of 3D ego-pose estimation using a head-mounted camera. We use a kinematics model of the human body to represent the entire range of human motion, and a dynamics model of the body to interact with objects inside a physics simulator. By bringing together object modeling, kinematics modeling, and dynamics modeling in a reinforcement learning (RL) framework, we enable object-aware 3D ego-pose estimation. We devise several representational innovations through the design of the state and action space to incorporate 3D scene context and improve pose estimation quality. We also construct a fine-tuning step to correct the drift and refine the estimated human-object interaction. This is the first work to estimate a physically valid 3D full-body interaction sequence with objects (e.g., chairs, boxes, obstacles) from egocentric videos. Experiments with both controlled and in-the-wild settings show that our method can successfully extract an object-conditioned 3D ego-pose sequence that is consistent with the laws of physics.

35.After All, Only The Last Neuron Matters: Comparing Multi-modal Fusion Functions for Scene Graph Generation ⬇️

From object segmentation to word vector representations, Scene Graph Generation (SGG) became a complex task built upon numerous research results. In this paper, we focus on the last module of this model: the fusion function. The role of this latter is to combine three hidden states. We perform an ablation test in order to compare different implementations. First, we reproduce the state-of-the-art results using SUM, and GATE functions. Then we expand the original solution by adding more model-agnostic functions: an adapted version of DIST and a mixture between MFB and GATE. On the basis of the state-of-the-art configuration, DIST performed the best Recall @ K, which makes it now part of the state-of-the-art.

36.Predicting Landsat Reflectance with Deep Generative Fusion ⬇️

Public satellite missions are commonly bound to a trade-off between spatial and temporal resolution as no single sensor provides fine-grained acquisitions with frequent coverage. This hinders their potential to assist vegetation monitoring or humanitarian actions, which require detecting rapid and detailed terrestrial surface changes. In this work, we probe the potential of deep generative models to produce high-resolution optical imagery by fusing products with different spatial and temporal characteristics. We introduce a dataset of co-registered Moderate Resolution Imaging Spectroradiometer (MODIS) and Landsat surface reflectance time series and demonstrate the ability of our generative model to blend coarse daily reflectance information into low-paced finer acquisitions. We benchmark our proposed model against state-of-the-art reflectance fusion algorithms.

37.MUSE: Illustrating Textual Attributes by Portrait Generation ⬇️

We propose a novel approach, MUSE, to illustrate textual attributes visually via portrait generation. MUSE takes a set of attributes written in text, in addition to facial features extracted from a photo of the subject as input. We propose 11 attribute types to represent inspirations from a subject's profile, emotion, story, and environment. We propose a novel stacked neural network architecture by extending an image-to-image generative model to accept textual attributes. Experiments show that our approach significantly outperforms several state-of-the-art methods without using textual attributes, with Inception Score score increased by 6% and Fréchet Inception Distance (FID) score decreased by 11%, respectively. We also propose a new attribute reconstruction metric to evaluate whether the generated portraits preserve the subject's attributes. Experiments show that our approach can accurately illustrate 78% textual attributes, which also help MUSE capture the subject in a more creative and expressive way.

38.Learning to Infer Semantic Parameters for 3D Shape Editing ⬇️

Many applications in 3D shape design and augmentation require the ability to make specific edits to an object's semantic parameters (e.g., the pose of a person's arm or the length of an airplane's wing) while preserving as much existing details as possible. We propose to learn a deep network that infers the semantic parameters of an input shape and then allows the user to manipulate those parameters. The network is trained jointly on shapes from an auxiliary synthetic template and unlabeled realistic models, ensuring robustness to shape variability while relieving the need to label realistic exemplars. At testing time, edits within the parameter space drive deformations to be applied to the original shape, which provides semantically-meaningful manipulation while preserving the details. This is in contrast to prior methods that either use autoencoders with a limited latent-space dimensionality, failing to preserve arbitrary detail, or drive deformations with purely-geometric controls, such as cages, losing the ability to update local part regions. Experiments with datasets of chairs, airplanes, and human bodies demonstrate that our method produces more natural edits than prior work.

39.Similarity-Based Clustering for Enhancing Image Classification Architectures ⬇️

Convolutional networks are at the center of best in class computer vision applications for a wide assortment of undertakings. Since 2014, profound amount of work began to make better convolutional architectures, yielding generous additions in different benchmarks. Albeit expanded model size and computational cost will, in general, mean prompt quality increases for most undertakings but, the architectures now need to have some additional information to increase the performance. We show empirical evidence that with the amalgamation of content-based image similarity and deep learning models, we can provide the flow of information which can be used in making clustered learning possible. We show how parallel training of sub-dataset clusters not only reduces the cost of computation but also increases the benchmark accuracies by 5-11 percent.

40.Ontology-driven Event Type Classification in Images ⬇️

Event classification can add valuable information for semantic search and the increasingly important topic of fact validation in news. So far, only few approaches address image classification for newsworthy event types such as natural disasters, sports events, or elections. Previous work distinguishes only between a limited number of event types and relies on rather small datasets for training. In this paper, we present a novel ontology-driven approach for the classification of event types in images. We leverage a large number of real-world news events to pursue two objectives: First, we create an ontology based on Wikidata comprising the majority of event types. Second, we introduce a novel large-scale dataset that was acquired through Web crawling. Several baselines are proposed including an ontology-driven learning approach that aims to exploit structured information of a knowledge graph to learn relevant event relations using deep neural networks. Experimental results on existing as well as novel benchmark datasets demonstrate the superiority of the proposed ontology-driven approach.

41.Explainable COVID-19 Detection Using Chest CT Scans and Deep Learning ⬇️

This paper explores how well deep learning models trained on chest CT images can diagnose COVID-19 infected people in a fast and automated process. To this end, we adopt advanced deep network architectures and propose a transfer learning strategy using custom-sized input tailored for each deep architecture to achieve the best performance. We conduct extensive sets of experiments on two CT image datasets, namely the SARS-CoV-2 CT-scan and the COVID19-CT. The obtained results show superior performances for our models compared with previous studies, where our best models achieve average accuracy, precision, sensitivity, specificity and F1 score of 99.4%, 99.6%, 99.8%, 99.6% and 99.4% on the SARS-CoV-2 dataset; and 92.9%, 91.3%, 93.7%, 92.2% and 92.5% on the COVID19-CT dataset, respectively. Furthermore, we apply two visualization techniques to provide visual explanations for the models' predictions. The visualizations show well-separated clusters for CT images of COVID-19 from other lung diseases, and accurate localizations of the COVID-19 associated regions.

42.An Attack on InstaHide: Is Private Learning Possible with Instance Encoding? ⬇️

A learning algorithm is private if the produced model does not reveal (too much) about its training set. InstaHide [Huang, Song, Li, Arora, ICML'20] is a recent proposal that claims to preserve privacy by an encoding mechanism that modifies the inputs before being processed by the normal learner.
We present a reconstruction attack on InstaHide that is able to use the encoded images to recover visually recognizable versions of the original images. Our attack is effective and efficient, and empirically breaks InstaHide on CIFAR-10, CIFAR-100, and the recently released InstaHide Challenge.
We further formalize various privacy notions of learning through instance encoding and investigate the possibility of achieving these notions. We prove barriers against achieving (indistinguishability based notions of) privacy through any learning protocol that uses instance encoding.

43.EPSR: Edge Profile Super resolution ⬇️

Recently numerous deep convolutional neural networks(CNNs) have been explored in single image super-resolution(SISR) and they achieved significant performance. However, most deep CNN-based SR mainly focuses on designing wider or deeper architecture and it is hard to find methods that utilize image properties in SISR. In this paper, by developing an edge-profile approach based on end-to-end CNN model to SISR problem, we propose an edge profile super resolution(EPSR). Specifically, we construct a residual edge enhance block(REEB), which consists of residual efficient channel attention block(RECAB), edge profile(EP) module, and context network(CN) module. RE-CAB extracts adaptively rescale channel-wise features by considering interdependencies among channels efficiently.From the features, EP module generates edge-guided features by extracting edge profile itself, and then CN module enhances details by exploiting contextual information of the features. To utilize various information from low to high frequency components, we design a fractal skip connection(FSC) structure. Since self-similarity of the architecture, FSC structure allows our EPSR to bypass abundant information into each REEB block. Experimental results present that our EPSR achieves competitive performance against state-of-the-art methods.

44.OpenKinoAI: An Open Source Framework for Intelligent Cinematography and Editing of Live Performances ⬇️

OpenKinoAI is an open source framework for post-production of ultra high definition video which makes it possible to emulate professional multiclip editing techniques for the case of single camera recordings. OpenKinoAI includes tools for uploading raw video footage of live performances on a remote web server, detecting, tracking and recognizing the performers in the original material, reframing the raw video into a large choice of cinematographic rushes, editing the rushes into movies, and annotating rushes and movies for documentation purposes. OpenKinoAI is made available to promote research in multiclip video editing of ultra high definition video, and to allow performing artists and companies to use this research for archiving, documenting and sharing their work online in an innovative fashion.

45.Pristine annotations-based multi-modal trained artificial intelligence solution to triage chest X-ray for COVID-19 ⬇️

The COVID-19 pandemic continues to spread and impact the well-being of the global population. The front-line modalities including computed tomography (CT) and X-ray play an important role for triaging COVID patients. Considering the limited access of resources (both hardware and trained personnel) and decontamination considerations, CT may not be ideal for triaging suspected subjects. Artificial intelligence (AI) assisted X-ray based applications for triaging and monitoring require experienced radiologists to identify COVID patients in a timely manner and to further delineate the disease region boundary are seen as a promising solution. Our proposed solution differs from existing solutions by industry and academic communities, and demonstrates a functional AI model to triage by inferencing using a single x-ray image, while the deep-learning model is trained using both X-ray and CT data. We report on how such a multi-modal training improves the solution compared to X-ray only training. The multi-modal solution increases the AUC (area under the receiver operating characteristic curve) from 0.89 to 0.93 and also positively impacts the Dice coefficient (0.59 to 0.62) for localizing the pathology. To the best our knowledge, it is the first X-ray solution by leveraging multi-modal information for the development.

46.Bridging the Performance Gap between FGSM and PGD Adversarial Training ⬇️

Deep learning achieves state-of-the-art performance in many tasks but exposes to the underlying vulnerability against adversarial examples. Across existing defense techniques, adversarial training with the projected gradient decent attack (adv.PGD) is considered as one of the most effective ways to achieve moderate adversarial robustness. However, adv.PGD requires too much training time since the projected gradient attack (PGD) takes multiple iterations to generate perturbations. On the other hand, adversarial training with the fast gradient sign method (adv.FGSM) takes much less training time since the fast gradient sign method (FGSM) takes one step to generate perturbations but fails to increase adversarial robustness. In this work, we extend adv.FGSM to make it achieve the adversarial robustness of adv.PGD. We demonstrate that the large curvature along FGSM perturbed direction leads to a large difference in performance of adversarial robustness between adv.FGSM and adv.PGD, and therefore propose combining adv.FGSM with a curvature regularization (adv.FGSMR) in order to bridge the performance gap between adv.FGSM and adv.PGD. The experiments show that adv.FGSMR has higher training efficiency than adv.PGD. In addition, it achieves comparable performance of adversarial robustness on MNIST dataset under white-box attack, and it achieves better performance than adv.PGD under white-box attack and effectively defends the transferable adversarial attack on CIFAR-10 dataset.

47.Principles of Stochastic Computing: Fundamental Concepts and Applications ⬇️

The semiconductor and IC industry is facing the issue of high energy consumption. In modern days computers and processing systems are designed based on the Turing machine and Von Neumann's architecture. This architecture mainly focused on designing systems based on deterministic behaviors. To tackle energy consumption and reliability in systems, Stochastic Computing was introduced. In this research, we aim to review and study the principles behind stochastic computing and its implementation techniques. By utilizing stochastic computing, we can achieve higher energy efficiency and smaller area sizes in terms of designing arithmetic units. Also, we aim to popularize the affiliation of Stochastic systems in designing futuristic BLSI and Neuromorphic systems.

48.Classification of optics-free images with deep neural networks ⬇️

The thinnest possible camera is achieved by removing all optics, leaving only the image sensor. We train deep neural networks to perform multi-class detection and binary classification (with accuracy of 92%) on optics-free images without the need for anthropocentric image reconstructions. Inferencing from optics-free images has the potential for enhanced privacy and power efficiency.

49.A Soft Computing Approach for Selecting and Combining Spectral Bands ⬇️

We introduce a soft computing approach for automatically selecting and combining indices from remote sensing multispectral images that can be used for classification tasks. The proposed approach is based on a Genetic-Programming (GP) framework, a technique successfully used in a wide variety of optimization problems. Through GP, it is possible to learn indices that maximize the separability of samples from two different classes. Once the indices specialized for all the pairs of classes are obtained, they are used in pixelwise classification tasks. We used the GP-based solution to evaluate complex classification problems, such as those that are related to the discrimination of vegetation types within and between tropical biomes. Using time series defined in terms of the learned spectral indices, we show that the GP framework leads to superior results than other indices that are used to discriminate and classify tropical biomes.

50.Noise2Stack: Improving Image Restoration by Learning from Volumetric Data ⬇️

Biomedical images are noisy. The imaging equipment itself has physical limitations, and the consequent experimental trade-offs between signal-to-noise ratio, acquisition speed, and imaging depth exacerbate the problem. Denoising is, therefore, an essential part of any image processing pipeline, and convolutional neural networks are currently the method of choice for this task. One popular approach, Noise2Noise, does not require clean ground truth, and instead, uses a second noisy copy as a training target. Self-supervised methods, like Noise2Self and Noise2Void, relax data requirements by learning the signal without an explicit target but are limited by the lack of information in a single image. Here, we introduce Noise2Stack, an extension of the Noise2Noise method to image stacks that takes advantage of a shared signal between spatially neighboring planes. Our experiments on magnetic resonance brain scans and newly acquired multiplane microscopy data show that learning only from image neighbors in a stack is sufficient to outperform Noise2Noise and Noise2Void and close the gap to supervised denoising methods. Our findings point towards low-cost, high-reward improvement in the denoising pipeline of multiplane biomedical images. As a part of this work, we release a microscopy dataset to establish a benchmark for the multiplane image denoising.

51.Deep correction of breathing-related artifacts in MR-thermometry ⬇️

Real-time MR-imaging has been clinically adapted for monitoring thermal therapies since it can provide on-the-fly temperature maps simultaneously with anatomical information. However, proton resonance frequency based thermometry of moving targets remains challenging since temperature artifacts are induced by the respiratory as well as physiological motion. If left uncorrected, these artifacts lead to severe errors in temperature estimates and impair therapy guidance. In this study, we evaluated deep learning for on-line correction of motion related errors in abdominal MR-thermometry. For this, a convolutional neural network (CNN) was designed to learn the apparent temperature perturbation from images acquired during a preparative learning stage prior to hyperthermia. The input of the designed CNN is the most recent magnitude image and no surrogate of motion is needed. During the subsequent hyperthermia procedure, the recent magnitude image is used as an input for the CNN-model in order to generate an on-line correction for the current temperature map. The method's artifact suppression performance was evaluated on 12 free breathing volunteers and was found robust and artifact-free in all examined cases. Furthermore, thermometric precision and accuracy was assessed for in vivo ablation using high intensity focused ultrasound. All calculations involved at the different stages of the proposed workflow were designed to be compatible with the clinical time constraints of a therapeutic procedure.

52.Tattoo tomography: Freehand 3D photoacoustic image reconstruction with an optical pattern ⬇️

Purpose: Photoacoustic tomography (PAT) is a novel imaging technique that can spatially resolve both morphological and functional tissue properties, such as the vessel topology and tissue oxygenation. While this capacity makes PAT a promising modality for the diagnosis, treatment and follow-up of various diseases, a current drawback is the limited field-of-view (FoV) provided by the conventionally applied 2D probes.
Methods: In this paper, we present a novel approach to 3D reconstruction of PAT data (Tattoo tomography) that does not require an external tracking system and can smoothly be integrated into clinical workflows. It is based on an optical pattern placed on the region of interest prior to image acquisition. This pattern is designed in a way that a tomographic image of it enables the recovery of the probe pose relative to the coordinate system of the pattern. This allows the transformation of a sequence of acquired PA images into one common global coordinate system and thus the consistent 3D reconstruction of PAT imaging data.
Results: An initial feasibility study conducted with experimental phantom data and in vivo forearm data indicates that the Tattoo approach is well-suited for 3D reconstruction of PAT data with high accuracy and precision.
Conclusion: In contrast to previous approaches to 3D ultrasound (US) or PAT reconstruction, the Tattoo approach neither requires complex external hardware nor training data acquired for a specific application. It could thus become a valuable tool for clinical freehand PAT.

53.AIM 2020 Challenge on Rendering Realistic Bokeh ⬇️

This paper reviews the second AIM realistic bokeh effect rendering challenge and provides the description of the proposed solutions and results. The participating teams were solving a real-world bokeh simulation problem, where the goal was to learn a realistic shallow focus technique using a large-scale EBB! bokeh dataset consisting of 5K shallow / wide depth-of-field image pairs captured using the Canon 7D DSLR camera. The participants had to render bokeh effect based on only one single frame without any additional data from other cameras or sensors. The target metric used in this challenge combined the runtime and the perceptual quality of the solutions measured in the user study. To ensure the efficiency of the submitted models, we measured their runtime on standard desktop CPUs as well as were running the models on smartphone GPUs. The proposed solutions significantly improved the baseline results, defining the state-of-the-art for practical bokeh effect rendering problem.

54.Towards a Better Global Loss Landscape of GANs ⬇️

Understanding of GAN training is still very limited. One major challenge is its non-convex-non-concave min-max objective, which may lead to sub-optimal local minima. In this work, we perform a global landscape analysis of the empirical loss of GANs. We prove that a class of separable-GAN, including the original JS-GAN, has exponentially many bad basins which are perceived as mode-collapse. We also study the relativistic pairing GAN (RpGAN) loss which couples the generated samples and the true samples. We prove that RpGAN has no bad basins. Experiments on synthetic data show that the predicted bad basin can indeed appear in training. We also perform experiments to support our theory that RpGAN has a better landscape than separable-GAN. For instance, we empirically show that RpGAN performs better than separable-GAN with relatively narrow neural nets. The code is available at this https URL.

55.The Virtual Goniometer: A new method for measuring angles on 3D models of fragmentary bone and lithics ⬇️

The contact goniometer is a commonly used tool in lithic and zooarchaeological analysis, despite suffering from a number of shortcomings due to the physical interaction between the measuring implement, the object being measured, and the individual taking the measurements. However, lacking a simple and efficient alternative, researchers in a variety of fields continue to use the contact goniometer to this day. In this paper, we present a new goniometric method that we call the virtual goniometer, which takes angle measurements virtually on a 3D model of an object. The virtual goniometer allows for rapid data collection, and for the measurement of many angles that cannot be physically accessed by a manual goniometer. We compare the intra-observer variability of the manual and virtual goniometers, and find that the virtual goniometer is far more consistent and reliable. Furthermore, the virtual goniometer allows for precise replication of angle measurements, even among multiple users, which is important for reproducibility of goniometric-based research. The virtual goniometer is available as a plug-in in the open source mesh processing packages Meshlab and Blender, making it easily accessible to researchers exploring the potential for goniometry to improve archaeological methods and address anthropological questions.

56.Learnings from Frontier Development Lab and SpaceML -- AI Accelerators for NASA and ESA ⬇️

Research with AI and ML technologies lives in a variety of settings with often asynchronous goals and timelines: academic labs and government organizations pursue open-ended research focusing on discoveries with long-term value, while research in industry is driven by commercial pursuits and hence focuses on short-term timelines and return on investment. The journey from research to product is often tacit or ad hoc, resulting in technology transition failures, further exacerbated when research and development is interorganizational and interdisciplinary. Even more, much of the ability to produce results remains locked in the private repositories and know-how of the individual researcher, slowing the impact on future research by others and contributing to the ML community's challenges in reproducibility. With research organizations focused on an exploding array of fields, opportunities for the handover and maturation of interdisciplinary research reduce. With these tensions, we see an emerging need to measure the correctness, impact, and relevance of research during its development to enable better collaboration, improved reproducibility, faster progress, and more trusted outcomes. We perform a case study of the Frontier Development Lab (FDL), an AI accelerator under a public-private partnership from NASA and ESA. FDL research follows principled practices that are grounded in responsible development, conduct, and dissemination of AI research, enabling FDL to churn successful interdisciplinary and interorganizational research projects, measured through NASA's Technology Readiness Levels. We also take a look at the SpaceML Open Source Research Program, which helps accelerate and transition FDL's research to deployable projects with wide spread adoption amongst citizen scientists.

57.Deep Reinforcement Learning for Navigation in AAA Video Games ⬇️

In video games, non-player characters (NPCs) are used to enhance the players' experience in a variety of ways, e.g., as enemies, allies, or innocent bystanders. A crucial component of NPCs is navigation, which allows them to move from one point to another on the map. The most popular approach for NPC navigation in the video game industry is to use a navigation mesh (NavMesh), which is a graph representation of the map, with nodes and edges indicating traversable areas. Unfortunately, complex navigation abilities that extend the character's capacity for movement, e.g., grappling hooks, jetpacks, teleportation, or double-jumps, increases the complexity of the NavMesh, making it intractable in many practical scenarios. Game designers are thus constrained to only add abilities that can be handled by a NavMesh if they want to have NPC navigation. As an alternative, we propose to use Deep Reinforcement Learning (Deep RL) to learn how to navigate 3D maps using any navigation ability. We test our approach on complex 3D environments in the Unity game engine that are notably an order of magnitude larger than maps typically used in the Deep RL literature. One of these maps is directly modeled after a Ubisoft AAA game. We find that our approach performs surprisingly well, achieving at least $90%$ success rate on all tested scenarios. A video of our results is available at this https URL.

58.Predicting the Future is like Completing a Painting! ⬇️

This article is an introductory work towards a larger research framework relative to Scientific Prediction. It is a mixed between science and philosophy of science, therefore we can talk about Experimental Philosophy of Science. As a first result, we introduce a new forecasting method based on image completion, named Forecasting Method by Image Inpainting (FM2I). In fact, time series forecasting is transformed into fully images- and signal-based processing procedures. After transforming a time series data into its corresponding image, the problem of data forecasting becomes essentially a problem of image inpainting problem, i.e., completing missing data in the image. An extensive experimental evaluation is conducted using a large dataset proposed by the well-known M3-competition. Results show that FM2I represents an efficient and robust tool for time series forecasting. It has achieved prominent results in terms of accuracy and outperforms the best M3 forecasting methods.

Files

20201111.md

Latest commit

History

20201111.md

File metadata and controls

ArXiv cs.CV --Wed, 11 Nov 2020

1.Learning to Communicate and Correct Pose Errors ⬇️

2.Perception Improvement for Free: Exploring Imperceptible Black-box Adversarial Attacks on Image Classification ⬇️

3.Classification of Polarimetric SAR Images Using Compact Convolutional Neural Networks ⬇️

4.Temporal Stochastic Softmax for 3D CNNs: An Application in Facial Expression Recognition ⬇️

5.Pixel precise unsupervised detection of viral particle proliferation in cellular imaging data ⬇️

6.A Multi-Plant Disease Diagnosis Method using Convolutional Neural Network ⬇️

7.Multi-modal, multi-task, multi-attention (M3) deep learning detection of reticular pseudodrusen: 1 towards automated and accessible classification of age-related macular degeneration ⬇️

8.Multi-pooled Inception features for no-reference image quality assessment ⬇️

9.On-Device Language Identification of Text in Images using Diacritic Characters ⬇️

10.MP-ResNet: Multi-path Residual Network for the Semantic segmentation of High-Resolution PolSAR Images ⬇️

11.Decoupled Appearance and Motion Learning for Efficient Anomaly Detection in Surveillance Video ⬇️

12.Human-centric Spatio-Temporal Video Grounding With Visual Transformers ⬇️

13.Point Cloud Registration Based on Consistency Evaluation of Rigid Transformation in Parameter Space ⬇️

14.Residual Pose: A Decoupled Approach for Depth-based 3D Human Pose Estimation ⬇️

15.Deep Multimodal Fusion by Channel Exchanging ⬇️

16.Joint Super-Resolution and Rectification for Solar Cell Inspection ⬇️

17.Removing Brightness Bias in Rectified Gradients ⬇️

18.AIM 2020 Challenge on Learned Image Signal Processing Pipeline ⬇️

19.SelfDeco: Self-Supervised Monocular Depth Completion in Challenging Indoor Environments ⬇️

20.Conceptual Compression via Deep Structure and Texture Synthesis ⬇️

21.Detecting Human-Object Interaction with Mixed Supervision ⬇️

22.Unsupervised Contrastive Photo-to-Caricature Translation based on Auto-distortion ⬇️

23.Multi-modal Fusion for Single-Stage Continuous Gesture Recognition ⬇️

24.Simple means Faster: Real-Time Human Motion Forecasting in Monocular First Person Videos on CPU ⬇️

25.Stage-wise Channel Pruning for Model Compression ⬇️

26.CoADNet: Collaborative Aggregation-and-Distribution Networks for Co-Salient Object Detection ⬇️

27.A low latency ASR-free end to end spoken language understanding system ⬇️

28.STCNet: Spatio-Temporal Cross Network for Industrial Smoke Detection ⬇️

29.On Efficient and Robust Metrics for RANSAC Hypotheses and 3D Rigid Registration ⬇️

30.Understanding the hand-gestures using Convolutional Neural Networks and Generative Adversial Networks ⬇️

31.Social-STAGE: Spatio-Temporal Multi-Modal Future Trajectory Forecast ⬇️

32.Ellipse Detection and Localization with Applications to Knots in Sawn Lumber Images ⬇️

33.CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection ⬇️

34.Kinematics-Guided Reinforcement Learning for Object-Aware 3D Ego-Pose Estimation ⬇️

35.After All, Only The Last Neuron Matters: Comparing Multi-modal Fusion Functions for Scene Graph Generation ⬇️

36.Predicting Landsat Reflectance with Deep Generative Fusion ⬇️

37.MUSE: Illustrating Textual Attributes by Portrait Generation ⬇️

38.Learning to Infer Semantic Parameters for 3D Shape Editing ⬇️

39.Similarity-Based Clustering for Enhancing Image Classification Architectures ⬇️

40.Ontology-driven Event Type Classification in Images ⬇️

41.Explainable COVID-19 Detection Using Chest CT Scans and Deep Learning ⬇️

42.An Attack on InstaHide: Is Private Learning Possible with Instance Encoding? ⬇️

43.EPSR: Edge Profile Super resolution ⬇️

44.OpenKinoAI: An Open Source Framework for Intelligent Cinematography and Editing of Live Performances ⬇️

45.Pristine annotations-based multi-modal trained artificial intelligence solution to triage chest X-ray for COVID-19 ⬇️

46.Bridging the Performance Gap between FGSM and PGD Adversarial Training ⬇️

47.Principles of Stochastic Computing: Fundamental Concepts and Applications ⬇️

48.Classification of optics-free images with deep neural networks ⬇️

49.A Soft Computing Approach for Selecting and Combining Spectral Bands ⬇️

50.Noise2Stack: Improving Image Restoration by Learning from Volumetric Data ⬇️

51.Deep correction of breathing-related artifacts in MR-thermometry ⬇️

52.Tattoo tomography: Freehand 3D photoacoustic image reconstruction with an optical pattern ⬇️

53.AIM 2020 Challenge on Rendering Realistic Bokeh ⬇️

54.Towards a Better Global Loss Landscape of GANs ⬇️

55.The Virtual Goniometer: A new method for measuring angles on 3D models of fragmentary bone and lithics ⬇️

56.Learnings from Frontier Development Lab and SpaceML -- AI Accelerators for NASA and ESA ⬇️

57.Deep Reinforcement Learning for Navigation in AAA Video Games ⬇️

58.Predicting the Future is like Completing a Painting! ⬇️