ArXiv cs.CV --Fri, 3 Jul 2020

1.Image Classification by Reinforcement Learning with Two-State Q-Learning ⬇️

In this paper, a simple and efficient Hybrid Classifier is presented which is based on deep learning and reinforcement learning. Q-Learning has been used with two states and 'two or three' actions. Other techniques found in the literature use feature map extracted from Convolutional Neural Networks and use these in the Q-states along with past history. This leads to technical difficulties in these approaches because the number of states is high due to large dimensions of the feature map. Because our technique uses only two Q-states it is straightforward and consequently has much lesser number of optimization parameters, and thus also has a simple reward function. Also, the proposed technique uses novel actions for processing images as compared to other techniques found in literature. The performance of the proposed technique is compared with other recent algorithms like ResNet50, InceptionV3, etc. on popular databases including ImageNet, Cats and Dogs Dataset, and Caltech-101 Dataset. Our approach outperforms others techniques on all the datasets used.

2.A Closer Look at Local Aggregation Operators in Point Cloud Analysis ⬇️

Recent advances of network architecture for point cloud processing are mainly driven by new designs of local aggregation operators. However, the impact of these operators to network performance is not carefully investigated due to different overall network architecture and implementation details in each solution. Meanwhile, most of operators are only applied in shallow architectures. In this paper, we revisit the representative local aggregation operators and study their performance using the same deep residual architecture. Our investigation reveals that despite the different designs of these operators, all of these operators make surprisingly similar contributions to the network performance under the same network input and feature numbers and result in the state-of-the-art accuracy on standard benchmarks. This finding stimulate us to rethink the necessity of sophisticated design of local aggregation operator for point cloud processing. To this end, we propose a simple local aggregation operator without learnable weights, named Position Pooling (PosPool), which performs similarly or slightly better than existing sophisticated operators. In particular, a simple deep residual network with PosPool layers achieves outstanding performance on all benchmarks, which outperforms the previous state-of-the methods on the challenging PartNet datasets by a large margin (7.4 mIoU). The code is publicly available at this https URL

3.Deep Single Image Manipulation ⬇️

Image manipulation has attracted much research over the years due to the popularity and commercial importance of the task. In recent years, deep neural network methods have been proposed for many image manipulation tasks. A major issue with deep methods is the need to train on large amounts of data from the same distribution as the target image, whereas collecting datasets encompassing the entire long-tail of images is impossible. In this paper, we demonstrate that simply training a conditional adversarial generator on the single target image is sufficient for performing complex image manipulations. We find that the key for enabling single image training is extensive augmentation of the input image and provide a novel augmentation method. Our network learns to map between a primitive representation of the image (e.g. edges) to the image itself. At manipulation time, our generator allows for making general image changes by modifying the primitive input representation and mapping it through the network. We extensively evaluate our method and find that it provides remarkable performance.

4.RELATE: Physically Plausible Multi-Object Scene Synthesis Using Structured Latent Spaces ⬇️

We present RELATE, a model that learns to generate physically plausible scenes and videos of multiple interacting objects. Similar to other generative approaches, RELATE is trained end-to-end on raw, unlabeled data. RELATE combines an object-centric GAN formulation with a model that explicitly accounts for correlations between individual objects. This allows the model to generate realistic scenes and videos from a physically-interpretable parameterization. Furthermore, we show that modeling the object correlation is necessary to learn to disentangle object positions and identity. We find that RELATE is also amenable to physically realistic scene editing and that it significantly outperforms prior art in object-centric scene generation in both synthetic (CLEVR, ShapeStacks) and real-world data (street traffic scenes). In addition, in contrast to state-of-the-art methods in object-centric generative modeling, RELATE also extends naturally to dynamic scenes and generates videos of high visual fidelity

5.Curriculum Manager for Source Selection in Multi-Source Domain Adaptation ⬇️

The performance of Multi-Source Unsupervised Domain Adaptation depends significantly on the effectiveness of transfer from labeled source domain samples. In this paper, we proposed an adversarial agent that learns a dynamic curriculum for source samples, called Curriculum Manager for Source Selection (CMSS). The Curriculum Manager, an independent network module, constantly updates the curriculum during training, and iteratively learns which domains or samples are best suited for aligning to the target. The intuition behind this is to force the Curriculum Manager to constantly re-measure the transferability of latent domains over time to adversarially raise the error rate of the domain discriminator. CMSS does not require any knowledge of the domain labels, yet it outperforms other methods on four well-known benchmarks by significant margins. We also provide interpretable results that shed light on the proposed method.

6.Globally Optimal Segmentation of Mutually Interacting Surfaces using Deep Learning ⬇️

Segmentation of multiple surfaces in medical images is a challenging problem, further complicated by the frequent presence of weak boundary and mutual influence between adjacent objects. The traditional graph-based optimal surface segmentation method has proven its effectiveness with its ability of capturing various surface priors in a uniform graph model. However, its efficacy heavily relies on handcrafted features that are used to define the surface cost for the "goodness" of a surface. Recently, deep learning (DL) is emerging as powerful tools for medical image segmentation thanks to its superior feature learning capability. Unfortunately, due to the scarcity of training data in medical imaging, it is nontrivial for DL networks to implicitly learn the global structure of the target surfaces, including surface interactions. In this work, we propose to parameterize the surface cost functions in the graph model and leverage DL to learn those parameters. The multiple optimal surfaces are then simultaneously detected by minimizing the total surface cost while explicitly enforcing the mutual surface interaction constraints. The optimization problem is solved by the primal-dual Internal Point Method, which can be implemented by a layer of neural networks, enabling efficient end-to-end training of the whole network. Experiments on Spectral Domain Optical Coherence Tomography (SD-OCT) retinal layer segmentation and Intravascular Ultrasound (IVUS) vessel wall segmentation demonstrated very promising results. All source code is public to facilitate further research at this direction.

7.Learning ordered pooling weights in image classification ⬇️

Spatial pooling is an important step in computer vision systems like Convolutional Neural Networks or the Bag-of-Words method. The spatial pooling purpose is to combine neighbouring descriptors to obtain a single descriptor for a given region (local or global). The resultant combined vector must be as discriminant as possible, in other words, must contain relevant information, while removing irrelevant and confusing details. Maximum and average are the most common aggregation functions used in the pooling step. To improve the aggregation of relevant information without degrading their discriminative power for image classification, we introduce a simple but effective scheme based on Ordered Weighted Average (OWA) aggregation operators. We present a method to learn the weights of the OWA aggregation operator in a Bag-of-Words framework and in Convolutional Neural Networks, and provide an extensive evaluation showing that OWA based pooling outperforms classical aggregation operators.

8.Reinforcement Learning Based Handwritten Digit Recognition with Two-State Q-Learning ⬇️

We present a simple yet efficient Hybrid Classifier based on Deep Learning and Reinforcement Learning. Q-Learning is used with two Q-states and four actions. Conventional techniques use feature maps extracted from Convolutional Neural Networks (CNNs) and include them in the Qstates along with past history. This leads to difficulties with these approaches as the number of states is very large number due to high dimensions of the feature maps. Since our method uses only two Q-states it is simple and has much lesser number of parameters to optimize and also thus has a straightforward reward function. Also, the approach uses unexplored actions for image processing vis-a-vis other contemporary techniques. Three datasets have been used for benchmarking of the approach. These are the MNIST Digit Image Dataset, the USPS Digit Image Dataset and the MATLAB Digit Image Dataset. The performance of the proposed hybrid classifier has been compared with other contemporary techniques like a well-established Reinforcement Learning Technique, AlexNet, CNN-Nearest Neighbor Classifier and CNNSupport Vector Machine Classifier. Our approach outperforms these contemporary hybrid classifiers on all the three datasets used.

9.Multiclass Classification with an Ensemble of Binary Classification Deep Networks ⬇️

Deep neural network classifiers have been used frequently and are efficient. In multiclass deep network classifiers, the burden of classifying samples of different classes is put on a single classifier. As shown in this paper, the classification capability of deep networks can be further increased by using an ensemble of binary classification deep networks. In the proposed approach, a single (one-versus-all) deep network binary classifier is dedicated to each category classification. Subsequently, binary classification deep network ensembles have been investigated. Every network in an ensemble has been trained by a one-versus-all binary training technique using the Stochastic Gradient Descent with Momentum Algorithm. For classification of the test sample, the sample is presented to each network in the ensemble. After softmax-layer score voting, the network with the largest score is assumed to have classified the sample. Digit image recognition has been used for experimentation. Three datasets have been used for experimentation viz. the MATLAB Digit Image Dataset, the USPS+ Digit Image Dataset, and the MNIST Digit Image Dataset. The experiments demonstrate that given sufficient training, a Binary Classification Convolutional Neural Network (BCCNN) ensemble can outperform a conventional Multi-class Convolutional Neural Network (MCNN). In one of the experiments, it was noted that after training and testing of a BCCNN ensemble and an MCNN respectively on a subset of the MNIST Digit Image Dataset, the BCCNN ensemble gave a higher accuracy of 98.03% as compared to the MCNN which gave an accuracy of 97.90%. The architecture of the BCCNNs in an ensemble has also been modified in order to increase their recognition accuracy. On a large subset of the MNIST Digit Image Dataset, the modified BCCNN ensemble gave a higher accuracy of 98.50%, while as the MCNN gave an accuracy of 98.4875%.

10.Weakly Supervised Segmentation with Multi-scale Adversarial Attention Gates ⬇️

Large, fine-grained image segmentation datasets, annotated at pixel-level, are difficult to obtain, particularly in medical imaging, where annotations also require expert knowledge. Weakly-supervised learning can train models by relying on weaker forms of annotation, such as scribbles. Here, we learn to segment using scribble annotations in an adversarial game. With unpaired segmentation masks, we train a multi-scale GAN to generate realistic segmentation masks at multiple resolutions, while we use scribbles to learn the correct position in the image. Central to the model's success is a novel attention gating mechanism, which we condition with adversarial signals to act as a shape prior, resulting in better object localization at multiple scales. We evaluated our model on several medical (ACDC, LVSC, CHAOS) and non-medical (PPSS) datasets, and we report performance levels matching those achieved by models trained with fully annotated segmentation masks. We also demonstrate extensions in a variety of settings: semi-supervised learning; combining multiple scribble sources (a crowdsourcing scenario) and multi-task learning (combining scribble and mask supervision). We will release expert-made scribble annotations for the ACDC dataset, and the code used for the experiments, at this https URL.

11.JUMPS: Joints Upsampling Method for Pose Sequences ⬇️

Human Pose Estimation is a low-level task useful for surveillance, human action recognition, and scene understanding at large. It also offers promising perspectives for the animation of synthetic characters. For all these applications, and especially the latter, estimating the positions of many joints is desirable for improved performance and realism. To this purpose, we propose a novel method called JUMPS for increasing the number of joints in 2D pose estimates and recovering occluded or missing joints. We believe this is the first attempt to address the issue. We build on a deep generative model that combines a GAN and an encoder. The GAN learns the distribution of high-resolution human pose sequences, the encoder maps the input low-resolution sequences to its latent space. Inpainting is obtained by computing the latent representation whose decoding by the GAN generator optimally matches the joints locations at the input. Post-processing a 2D pose sequence using our method provides a richer representation of the character motion. We show experimentally that the localization accuracy of the additional joints is on average on par with the original pose estimates.

12.Automatic Page Segmentation Without Decompressing the Run-Length Compressed Text Documents ⬇️

Page segmentation is considered to be the crucial stage for the automatic analysis of documents with complex layouts. This has traditionally been carried out in uncompressed documents, although most of the documents in real life exist in a compressed form warranted by the requirement to make storage and transfer efficient. However, carrying out page segmentation directly in compressed documents without going through the stage of decompression is a challenging goal. This research paper proposes demonstrating the possibility of carrying out a page segmentation operation directly in the run-length data of the CCITT Group-3 compressed text document, which could be single- or multi-columned and might even have some text regions in the inverted text color mode. Therefore, before carrying out the segmentation of the text document into columns, each column into paragraphs, each paragraph into text lines, each line into words, and, finally, each word into characters, a pre-processing of the text document needs to be carried out. The pre-processing stage identifies the normal text regions and inverted text regions, and the inverted text regions are toggled to the normal mode. In the sequel to initiate column separation, a new strategy of incremental assimilation of white space runs in the vertical direction and the auto-estimation of certain related parameters is proposed. A procedure to realize column-segmentation employing these extracted parameters has been devised. Subsequently, what follows first is a two-level horizontal row separation process, which segments every column into paragraphs, and in turn, into text-lines. Then, there is a two-level vertical column separation process, which completes the separation into words and characters.

13.Motion Prediction in Visual Object Tracking ⬇️

Visual object tracking (VOT) is an essential component for many applications, such as autonomous driving or assistive robotics. However, recent works tend to develop accurate systems based on more computationally expensive feature extractors for better instance matching. In contrast, this work addresses the importance of motion prediction in VOT. We use an off-the-shelf object detector to obtain instance bounding boxes. Then, a combination of camera motion decouple and Kalman filter is used for state estimation. Although our baseline system is a straightforward combination of standard methods, we obtain state-of-the-art results. Our method establishes new state-of-the-art performance on VOT (VOT-2016 and VOT-2018). Our proposed method improves the EAO on VOT-2016 from 0.472 of prior art to 0.505, from 0.410 to 0.431 on VOT-2018. To show the generalizability, we also test our method on video object segmentation (VOS: DAVIS-2016 and DAVIS-2017) and observe consistent improvement.

14.Estimating Blink Probability for Highlight Detection in Figure Skating Videos ⬇️

Highlight detection in sports videos has a broad viewership and huge commercial potential. It is thus imperative to detect highlight scenes more suitably for human interest with high temporal accuracy. Since people instinctively suppress blinks during attention-grabbing events and synchronously generate blinks at attention break points in videos, the instantaneous blink rate can be utilized as a highly accurate temporal indicator of human interest. Therefore, in this study, we propose a novel, automatic highlight detection method based on the blink rate. The method trains a one-dimensional convolution network (1D-CNN) to assess blink rates at each video frame from the spatio-temporal pose features of figure skating videos. Experiments show that the method successfully estimates the blink rate in 94% of the video clips and predicts the temporal change in the blink rate around a jump event with high accuracy. Moreover, the method detects not only the representative athletic action, but also the distinctive artistic expression of figure skating performance as key frames. This suggests that the blink-rate-based supervised learning approach enables high-accuracy highlight detection that more closely matches human sensibility.

15.Mining and Tailings Dam Detection In Satellite Imagery Using Deep Learning ⬇️

This work explores the combination of free cloud computing, free open-source software, and deep learning methods to analyse a real, large-scale problem: the automatic country-wide identification and classification of surface mines and mining tailings dams in Brazil. Locations of officially registered mines and dams were obtained from the Brazilian government open data resource. Multispectral Sentinel-2 satellite imagery, obtained and processed at the Google Earth Engine platform, was used to train and test deep neural networks using the TensorFlow 2 API and Google Colab platform. Fully Convolutional Neural Networks were used in an innovative way, to search for unregistered ore mines and tailing dams in large areas of the Brazilian territory. The efficacy of the approach is demonstrated by the discovery of 263 mines that do not have an official mining concession. This exploratory work highlights the potential of a set of new technologies, freely available, for the construction of low cost data science tools that have high social impact. At the same time, it discusses and seeks to suggest practical solutions for the complex and serious problem of illegal mining and the proliferation of tailings dams, which pose high risks to the population and the environment, especially in developing countries. Code is made publicly available at: this https URL.

16.Attention-Oriented Action Recognition for Real-Time Human-Robot Interaction ⬇️

Despite the notable progress made in action recognition tasks, not much work has been done in action recognition specifically for human-robot interaction. In this paper, we deeply explore the characteristics of the action recognition task in interaction scenarios and propose an attention-oriented multi-level network framework to meet the need for real-time interaction. Specifically, a Pre-Attention network is employed to roughly focus on the interactor in the scene at low resolution firstly and then perform fine-grained pose estimation at high resolution. The other compact CNN receives the extracted skeleton sequence as input for action recognition, utilizing attention-like mechanisms to capture local spatial-temporal patterns and global semantic information effectively. To evaluate our approach, we construct a new action dataset specially for the recognition task in interaction scenarios. Experimental results on our dataset and high efficiency (112 fps at 640 x 480 RGBD) on the mobile computing platform (Nvidia Jetson AGX Xavier) demonstrate excellent applicability of our method on action recognition in real-time human-robot interaction.

17.Are there any 'object detectors' in the hidden layers of CNNs trained to identify objects or scenes? ⬇️

Various methods of measuring unit selectivity have been developed with the aim of better understanding how neural networks work. But the different measures provide divergent estimates of selectivity, and this has led to different conclusions regarding the conditions in which selective object representations are learned and the functional relevance of these representations. In an attempt to better characterize object selectivity, we undertake a comparison of various selectivity measures on a large set of units in AlexNet, including localist selectivity, precision, class-conditional mean activity selectivity (CCMAS), network dissection,the human interpretation of activation maximization (AM) images, and standard signal-detection measures. We find that the different measures provide different estimates of object selectivity, with precision and CCMAS measures providing misleadingly high estimates. Indeed, the most selective units had a poor hit-rate or a high false-alarm rate (or both) in object classification, making them poor object detectors. We fail to find any units that are even remotely as selective as the 'grandmother cell' units reported in recurrent neural networks. In order to generalize these results, we compared selectivity measures on units in VGG-16 and GoogLeNet trained on the ImageNet or Places-365 datasets that have been described as 'object detectors'. Again, we find poor hit-rates and high false-alarm rates for object classification. We conclude that signal-detection measures provide a better assessment of single-unit selectivity compared to common alternative approaches, and that deep convolutional networks of image classification do not learn object detectors in their hidden layers.

18.Unsupervised Landmark Learning from Unpaired Data ⬇️

Recent attempts for unsupervised landmark learning leverage synthesized image pairs that are similar in appearance but different in poses. These methods learn landmarks by encouraging the consistency between the original images and the images reconstructed from swapped appearances and poses. While synthesized image pairs are created by applying pre-defined transformations, they can not fully reflect the real variances in both appearances and poses. In this paper, we aim to open the possibility of learning landmarks on unpaired data (i.e. unaligned image pairs) sampled from a natural image collection, so that they can be different in both appearances and poses. To this end, we propose a cross-image cycle consistency framework ($C^3$) which applies the swapping-reconstruction strategy twice to obtain the final supervision. Moreover, a cross-image flow module is further introduced to impose the equivariance between estimated landmarks across images. Through comprehensive experiments, our proposed framework is shown to outperform strong baselines by a large margin. Besides quantitative results, we also provide visualization and interpretation on our learned models, which not only verifies the effectiveness of the learned landmarks, but also leads to important insights that are beneficial for future research.

19.ReXNet: Diminishing Representational Bottleneck on Convolutional Neural Network ⬇️

This paper addresses representational bottleneck in a network and propose a set of design principles that improves model performance significantly. We argue that a representational bottleneck may happen in a network designed by a conventional design and results in degrading the model performance. To investigate the representational bottleneck, we study the matrix rank of the features generated by ten thousand random networks. We further study the entire layer's channel configuration towards designing more accurate network architectures. Based on the investigation, we propose simple yet effective design principles to mitigate the representational bottleneck. Slight changes on baseline networks by following the principle leads to achieving remarkable performance improvements on ImageNet classification. Additionally, COCO object detection results and transfer learning results on several datasets provide other backups of the link between diminishing representational bottleneck of a network and improving performance. Code and pretrained models are available at this https URL.

20.RGB-D-based Framework to Acquire, Visualize and Measure the Human Body for Dietetic Treatments ⬇️

This research aims to improve dietetic-nutritional treatment using state-of-the-art RGB-D sensors and virtual reality (VR) technology. Recent studies show that adherence to treatment can be improved using multimedia technologies. However, there are few studies using 3D data and VR technologies for this purpose. On the other hand, obtaining 3D measurements of the human body and analyzing them over time (4D) in patients undergoing dietary treatment is a challenging field. The main contribution of the work is to provide a framework to study the effect of 4D body model visualization on adherence to obesity treatment. The system can obtain a complete 3D model of a body using low-cost technology, allowing future straightforward transference with sufficient accuracy and realistic visualization, enabling the analysis of the evolution (4D) of the shape during the treatment of obesity. The 3D body models will be used for studying the effect of visualization on adherence to obesity treatment using 2D and VR devices. Moreover, we will use the acquired 3D models to obtain measurements of the body. An analysis of the accuracy of the proposed methods for obtaining measurements with both synthetic and real objects has been carried out.

21.PerceptionGAN: Real-world Image Construction from Provided Text through Perceptual Understanding ⬇️

Generating an image from a provided descriptive text is quite a challenging task because of the difficulty in incorporating perceptual information (object shapes, colors, and their interactions) along with providing high relevancy related to the provided text. Current methods first generate an initial low-resolution image, which typically has irregular object shapes, colors, and interaction between objects. This initial image is then improved by conditioning on the text. However, these methods mainly address the problem of using text representation efficiently in the refinement of the initially generated image, while the success of this refinement process depends heavily on the quality of the initially generated image, as pointed out in the DM-GAN paper. Hence, we propose a method to provide good initialized images by incorporating perceptual understanding in the discriminator module. We improve the perceptual information at the first stage itself, which results in significant improvement in the final generated image. In this paper, we have applied our approach to the novel StackGAN architecture. We then show that the perceptual information included in the initial image is improved while modeling image distribution at multiple stages. Finally, we generated realistic multi-colored images conditioned by text. These images have good quality along with containing improved basic perceptual information. More importantly, the proposed method can be integrated into the pipeline of other state-of-the-art text-based-image-generation models to generate initial low-resolution images. We also worked on improving the refinement process in StackGAN by augmenting the third stage of the generator-discriminator pair in the StackGAN architecture. Our experimental analysis and comparison with the state-of-the-art on a large but sparse dataset MS COCO further validate the usefulness of our proposed approach.

22.A deep primal-dual proximal network for image restoration ⬇️

Image restoration remains a challenging task in image processing. Numerous methods have been proposed to tackle this problem, which is often solved by minimizing a non-smooth penalized likelihood function. Although the solution is easily interpretable with theoretic guarantees, its estimation relies on an optimization process. Considering the important research efforts in deep learning for image classification, they offers an alternative to perform image restoration but its adaptation to inverse problem is still challenging. In this work, we design a deep network, named DeepPDNet, built from primal-dual proximal iterations associated with the minimization of a standard penalized likelihood with an analysis prior, allowing us to take advantages from both worlds.
We reformulate a specific instance of the Condat-Vu primal-dual hybrid gradient (PDHG) algorithm as a deep network with fixed layers. Each layer corresponds to one iteration of the primal-dual algorithm. The learned parameters are the primal-dual proximal algorithm step-size and the analysis linear operator involved in the penalization. These parameters are allowed to vary from a layer to another one. Two different learning strategies: "Full learning" and "Partial learning" are proposed, the first one is the most efficient numerically while the second one relies on standard constraints insuring convergence in the standard PDHG iterations. Moreover, global and local sparse analysis prior are studied to seek the better feature representation. We experiment the proposed DeepPDNet on the MNIST and BSD68 datasets with different blur and additive Gaussian noise. Extensive results shows that the proposed deep primal-dual proximal networks demonstrate excellent performance on the MNIST dataset compared to other state-of-the-art methods and better or at least comparable performance on the more complex BSD68 dataset.

23.The Impact of Explanations on AI Competency Prediction in VQA ⬇️

Explainability is one of the key elements for building trust in AI systems. Among numerous attempts to make AI explainable, quantifying the effect of explanations remains a challenge in conducting human-AI collaborative tasks. Aside from the ability to predict the overall behavior of AI, in many applications, users need to understand an AI agent's competency in different aspects of the task domain. In this paper, we evaluate the impact of explanations on the user's mental model of AI agent competency within the task of visual question answering (VQA). We quantify users' understanding of competency, based on the correlation between the actual system performance and user rankings. We introduce an explainable VQA system that uses spatial and object features and is powered by the BERT language model. Each group of users sees only one kind of explanation to rank the competencies of the VQA model. The proposed model is evaluated through between-subject experiments to probe explanations' impact on the user's perception of competency. The comparison between two VQA models shows BERT based explanations and the use of object features improve the user's prediction of the model's competencies.

24.ACFD: Asymmetric Cartoon Face Detector ⬇️

Cartoon face detection is a more challenging task than human face detection due to many difficult scenarios is involved. Aiming at the characteristics of cartoon faces, such as huge differences within the intra-faces, in this paper, we propose an asymmetric cartoon face detector, named ACFD. Specifically, it consists of the following modules: a novel backbone VoVNetV3 comprised of several asymmetric one-shot aggregation modules (AOSA), asymmetric bi-directional feature pyramid network (ABi-FPN), dynamic anchor match strategy (DAM) and the corresponding margin binary classification loss (MBC). In particular, to generate features with diverse receptive fields, multi-scale pyramid features are extracted by VoVNetV3, and then fused and enhanced simultaneously by ABi-FPN for handling the faces in some extreme poses and have disparate aspect ratios. Besides, DAM is used to match enough high-quality anchors for each face, and MBC is for the strong power of discrimination. With the effectiveness of these modules, our ACFD achieves the 1st place on the detection track of 2020 iCartoon Face Challenge under the constraints of model size 200MB, inference time 50ms per image, and without any pretrained models.

25.Image Analysis Based on Nonnegative/Binary Matrix Factorization ⬇️

Using nonnegative/binary matrix factorization (NBMF), a matrix can be decomposed into a nonnegative matrix and a binary matrix. Our analysis of facial images, based on NBMF and using the Fujitsu Digital Annealer, leads to successful image reconstruction and image classification. The NBMF algorithm converges in fewer iterations than those required for the convergence of nonnegative matrix factorization (NMF), although both techniques perform comparably in image classification.

26.Noticing Motion Patterns: Temporal CNN with a Novel Convolution Operator for Human Trajectory Prediction ⬇️

We propose a novel way to learn, detect and extract patterns in sequential data, and successfully applied it to the problem of human trajectory prediction. Our model, Social Pattern Extraction Convolution (Social-PEC), when compared to existing methods, achieves the best performance in terms of Average/Final Displacement Error. In addition, the proposed approach avoids the obscurity in the previous use of pooling layer, presenting intuitive and explainable decision making processes.

27.MSA-MIL: A deep residual multiple instance learning model based on multi-scale annotation for classification and visualization of glomerular spikes ⬇️

Membranous nephropathy (MN) is a frequent type of adult nephrotic syndrome, which has a high clinical incidence and can cause various complications. In the biopsy microscope slide of membranous nephropathy, spikelike projections on the glomerular basement membrane is a prominent feature of the MN. However, due to the whole biopsy slide contains large number of glomeruli, and each glomerulus includes many spike lesions, the pathological feature of the spikes is not obvious. It thus is time-consuming for doctors to diagnose glomerulus one by one and is difficult for pathologists with less experience to diagnose. In this paper, we establish a visualized classification model based on the multi-scale annotation multi-instance learning (MSA-MIL) to achieve glomerular classification and spikes visualization. The MSA-MIL model mainly involves three parts. Firstly, U-Net is used to extract the region of the glomeruli to ensure that the features learned by the succeeding algorithm are focused inside the glomeruli itself. Secondly, we use MIL to train an instance-level classifier combined with MSA method to enhance the learning ability of the network by adding a location-level labeled reinforced dataset, thereby obtaining an example-level feature representation with rich semantics. Lastly, the predicted scores of each tile in the image are summarized to obtain glomerular classification and visualization of the classification results of the spikes via the usage of sliding window method. The experimental results confirm that the proposed MSA-MIL model can effectively and accurately classify normal glomeruli and spiked glomerulus and visualize the position of spikes in the glomerulus. Therefore, the proposed model can provide a good foundation for assisting the clinical doctors to diagnose the glomerular membranous nephropathy.

28.Low-light Environment Neural Surveillance ⬇️

We design and implement an end-to-end system for real-time crime detection in low-light environments. Unlike Closed-Circuit Television, which performs reactively, the Low-Light Environment Neural Surveillance provides real time crime alerts. The system uses a low-light video feed processed in real-time by an optical-flow network, spatial and temporal networks, and a Support Vector Machine to identify shootings, assaults, and thefts. We create a low-light action-recognition dataset, LENS-4, which will be publicly available. An IoT infrastructure set up via Amazon Web Services interprets messages from the local board hosting the camera for action recognition and parses the results in the cloud to relay messages. The system achieves 71.5% accuracy at 20 FPS. The user interface is a mobile app which allows local authorities to receive notifications and to view a video of the crime scene. Citizens have a public app which enables law enforcement to push crime alerts based on user proximity.

29.Understanding Road Layout from Videos as a Whole ⬇️

In this paper, we address the problem of inferring the layout of complex road scenes from video sequences. To this end, we formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently. In contrast to prior work, we exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information. Specifically, we introduce a model that aims to enforce prediction consistency in videos. Our model consists of one LSTM and one Feature Transform Module (FTM). The former implicitly incorporates the consistency constraint with its hidden states, and the latter explicitly takes the camera motion into consideration when aggregating information along videos. Moreover, we propose to incorporate context information by introducing road participants, e.g. objects, into our model. When the entire video sequence is available, our model is also able to encode both local and global cues, e.g. information from both past and future frames. Experiments on two data sets show that: (1) Incorporating either globalor contextual cues improves the prediction accuracy and leveraging both gives the best performance. (2) Introducing the LSTM and FTM modules improves the prediction consistency in videos. (3) The proposed method outperforms the SOTA by a large margin.

30.Query-Free Adversarial Transfer via Undertrained Surrogates ⬇️

Deep neural networks have been shown to be highly vulnerable to adversarial examples---minor perturbations added to a model's input which cause the model to output an incorrect prediction. This vulnerability represents both a risk for the use of deep learning models in security-conscious fields and an opportunity to improve our understanding of how deep networks generalize to unexpected inputs. In a transfer attack, the adversary builds an adversarial attack using a surrogate model, then uses that attack to fool an unseen target model. Recent work in this subfield has focused on attack generation methods which can improve transferability between models. We show that optimizing a single surrogate model is a more effective method of improving adversarial transfer, using the simple example of an undertrained surrogate. This method transfers well across varied architectures and outperforms state-of-the-art methods. To interpret the effectiveness of undertrained surrogate models, we represent adversarial transferability as a function of surrogate model loss function curvature and similarity between surrogate and target gradients and show that our approach reduces the presence of local loss maxima which hinder transferability. Our results suggest that finding good single surrogate models is a highly effective and simple method for generating transferable adversarial attacks, and that this method represents a valuable route for future study in this field.

31.TiledSoilingNet: Tile-level Soiling Detection on Automotive Surround-view Cameras Using Coverage Metric ⬇️

Automotive cameras, particularly surround-view cameras, tend to get soiled by mud, water, snow, etc. For higher levels of autonomous driving, it is necessary to have a soiling detection algorithm which will trigger an automatic cleaning system. Localized detection of soiling in an image is necessary to control the cleaning system. It is also necessary to enable partial functionality in unsoiled areas while reducing confidence in soiled areas. Although this can be solved using a semantic segmentation task, we explore a more efficient solution targeting deployment in low power embedded system. We propose a novel method to regress the area of each soiling type within a tile directly. We refer to this as coverage. The proposed approach is better than learning the dominant class in a tile as multiple soiling types occur within a tile commonly. It also has the advantage of dealing with coarse polygon annotation, which will cause the segmentation task. The proposed soiling coverage decoder is an order of magnitude faster than an equivalent segmentation decoder. We also integrated it into an object detection and semantic segmentation multi-task model using an asynchronous back-propagation algorithm. A portion of the dataset used will be released publicly as part of our WoodScape dataset to encourage further research.

32.Learning Surrogates via Deep Embedding ⬇️

This paper proposes a technique for training a neural network by minimizing a surrogate loss that approximates the target evaluation metric, which may be non-differentiable. The surrogate is learned via a deep embedding where the Euclidean distance between the prediction and the ground truth corresponds to the value of the evaluation metric. The effectiveness of the proposed technique is demonstrated in a post-tuning setup, where a trained model is tuned using the learned surrogate. Without a significant computational overhead and any bells and whistles, improvements are demonstrated on challenging and practical tasks of scene-text recognition and detection. In the recognition task, the model is tuned using a surrogate approximating the edit distance metric and achieves up to $39%$ relative improvement in the total edit distance. In the detection task, the surrogate approximates the intersection over union metric for rotated bounding boxes and yields up to $4.25%$ relative improvement in the $F_{1}$ score.

33.Age-Oriented Face Synthesis with Conditional Discriminator Pool and Adversarial Triplet Loss ⬇️

The vanilla Generative Adversarial Networks (GAN) are commonly used to generate realistic images depicting aged and rejuvenated faces. However, the performance of such vanilla GANs in the age-oriented face synthesis task is often compromised by the mode collapse issue, which may result in the generation of faces with minimal variations and a poor synthesis accuracy. In addition, recent age-oriented face synthesis methods use the L1 or L2 constraint to preserve the identity information on synthesized faces, which implicitly limits the identity permanence capabilities when these constraints are associated with a trivial weighting factor. In this paper, we propose a method for the age-oriented face synthesis task that achieves a high synthesis accuracy with strong identity permanence capabilities. Specifically, to achieve a high synthesis accuracy, our method tackles the mode collapse issue with a novel Conditional Discriminator Pool (CDP), which consists of multiple discriminators, each targeting one particular age category. To achieve strong identity permanence capabilities, our method uses a novel Adversarial Triplet loss. This loss, which is based on the Triplet loss, adds a ranking operation to further pull the positive embedding towards the anchor embedding resulting in significantly reduced intra-class variances in the feature space. Through extensive experiments, we show that our proposed method outperforms state-of-the-art methods in terms of synthesis accuracy and identity permanence capabilities, qualitatively and quantitatively.

34.Self-supervised Deep Reconstruction of Mixed Strip-shredded Text Documents ⬇️

The reconstruction of shredded documents consists of coherently arranging fragments of paper (shreds) to recover the original document(s). A great challenge in computational reconstruction is to properly evaluate the compatibility between the shreds. While traditional pixel-based approaches are not robust to real shredding, more sophisticated solutions compromise significantly time performance. The solution presented in this work extends our previous deep learning method for single-page reconstruction to a more realistic/complex scenario: the reconstruction of several mixed shredded documents at once. In our approach, the compatibility evaluation is modeled as a two-class (valid or invalid) pattern recognition problem. The model is trained in a self-supervised manner on samples extracted from simulated-shredded documents, which obviates manual annotation. Experimental results on three datasets -- including a new collection of 100 strip-shredded documents produced for this work -- have shown that the proposed method outperforms the competing ones on complex scenarios, achieving accuracy superior to 90%.

35.Weakly-Supervised Segmentation for Disease Localization in Chest X-Ray Images ⬇️

Deep Convolutional Neural Networks have proven effective in solving the task of semantic segmentation. However, their efficiency heavily relies on the pixel-level annotations that are expensive to get and often require domain expertise, especially in medical imaging. Weakly supervised semantic segmentation helps to overcome these issues and also provides explainable deep learning models. In this paper, we propose a novel approach to the semantic segmentation of medical chest X-ray images with only image-level class labels as supervision. We improve the disease localization accuracy by combining three approaches as consecutive steps. First, we generate pseudo segmentation labels of abnormal regions in the training images through a supervised classification model enhanced with a regularization procedure. The obtained activation maps are then post-processed and propagated into a second classification model-Inter-pixel Relation Network, which improves the boundaries between different object classes. Finally, the resulting pseudo-labels are used to train a proposed fully supervised segmentation model. We analyze the robustness of the presented method and test its performance on two distinct datasets: PASCAL VOC 2012 and SIIM-ACR Pneumothorax. We achieve significant results in the segmentation on both datasets using only image-level annotations. We show that this approach is applicable to chest X-rays for detecting an anomalous volume of air in the pleural space between the lung and the chest wall. Our code has been made publicly available.

36.Virtual Testbed for Monocular Visual Navigation of Small Unmanned Aircraft Systems ⬇️

Monocular visual navigation methods have seen significant advances in the last decade, recently producing several real-time solutions for autonomously navigating small unmanned aircraft systems without relying on GPS. This is critical for military operations which may involve environments where GPS signals are degraded or denied. However, testing and comparing visual navigation algorithms remains a challenge since visual data is expensive to gather. Conducting flight tests in a virtual environment is an attractive solution prior to committing to outdoor testing.
This work presents a virtual testbed for conducting simulated flight tests over real-world terrain and analyzing the real-time performance of visual navigation algorithms at 31 Hz. This tool was created to ultimately find a visual odometry algorithm appropriate for further GPS-denied navigation research on fixed-wing aircraft, even though all of the algorithms were designed for other modalities. This testbed was used to evaluate three current state-of-the-art, open-source monocular visual odometry algorithms on a fixed-wing platform: Direct Sparse Odometry, Semi-Direct Visual Odometry, and ORB-SLAM2 (with loop closures disabled).

37.Learning Geocentric Object Pose in Oblique Monocular Images ⬇️

An object's geocentric pose, defined as the height above ground and orientation with respect to gravity, is a powerful representation of real-world structure for object detection, segmentation, and localization tasks using RGBD images. For close-range vision tasks, height and orientation have been derived directly from stereo-computed depth and more recently from monocular depth predicted by deep networks. For long-range vision tasks such as Earth observation, depth cannot be reliably estimated with monocular images. Inspired by recent work in monocular height above ground prediction and optical flow prediction from static images, we develop an encoding of geocentric pose to address this challenge and train a deep network to compute the representation densely, supervised by publicly available airborne lidar. We exploit these attributes to rectify oblique images and remove observed object parallax to dramatically improve the accuracy of localization and to enable accurate alignment of multiple images taken from very different oblique viewpoints. We demonstrate the value of our approach by extending two large-scale public datasets for semantic segmentation in oblique satellite images. All of our data and code are publicly available.

38.ConFoc: Content-Focus Protection Against Trojan Attacks on Neural Networks ⬇️

Deep Neural Networks (DNNs) have been applied successfully in computer vision. However, their wide adoption in image-related applications is threatened by their vulnerability to trojan attacks. These attacks insert some misbehavior at training using samples with a mark or trigger, which is exploited at inference or testing time. In this work, we analyze the composition of the features learned by DNNs at training. We identify that they, including those related to the inserted triggers, contain both content (semantic information) and style (texture information), which are recognized as a whole by DNNs at testing time. We then propose a novel defensive technique against trojan attacks, in which DNNs are taught to disregard the styles of inputs and focus on their content only to mitigate the effect of triggers during the classification. The generic applicability of the approach is demonstrated in the context of a traffic sign and a face recognition application. Each of them is exposed to a different attack with a variety of triggers. Results show that the method reduces the attack success rate significantly to values < 1% in all the tested attacks while keeping as well as improving the initial accuracy of the models when processing both benign and adversarial data.

39.Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning ⬇️

Existing semi-supervised learning (SSL) algorithms use a single weight to balance the loss of labeled and unlabeled examples, i.e., all unlabeled examples are equally weighted. But not all unlabeled data are equal. In this paper we study how to use a different weight for every unlabeled example. Manual tuning of all those weights -- as done in prior work -- is no longer possible. Instead, we adjust those weights via an algorithm based on the influence function, a measure of a model's dependency on one training example. To make the approach efficient, we propose a fast and effective approximation of the influence function. We demonstrate that this technique outperforms state-of-the-art methods on semi-supervised image and language classification tasks.

40.Image Processing and Quality Control for Abdominal Magnetic Resonance Imaging in the UK Biobank ⬇️

An end-to-end image analysis pipeline is presented for the abdominal MRI protocol used in the UK Biobank on the first 38,971 participants. Emphasis is on the processing steps necessary to ensure a high-level of data quality and consistency is produced in order to prepare the datasets for downstream quantitative analysis, such as segmentation and parameter estimation. Quality control procedures have been incorporated to detect and, where possible, correct issues in the raw data. Detection of fat-water swaps in the Dixon series is performed by a deep learning model and corrected automatically. Bone joints are predicted using a hybrid atlas-based registration and deep learning model for the shoulders, hips and knees. Simultaneous estimation of proton density fat fraction and transverse relaxivity (R2*) is performed using both the magnitude and phase information for the single-slice multiecho series. Approximately 98.1% of the two-point Dixon acquisitions were successfully processed and passed quality control, with 99.98% of the high-resolution T1-weighted 3D volumes succeeding. Approximately 99.98% of the single-slice multiecho acquisitions covering the liver were successfully processed and passed quality control, with 97.6% of the single-slice multiecho acquisitions covering the pancreas succeeding. At least one fat-water swap was detected in 1.8% of participants. With respect to the bone joints, approximately 3.3% of participants were missing at least one knee joint and 0.8% were missing at least one shoulder joint. For the participants who received both single-slice multiecho acquisition protocols for the liver a systematic difference between the two protocols was identified and modeled using multiple linear regression. The findings presented here will be invaluable for scientists who seek to use image-derived phenotypes from the abdominal MRI protocol.

41.Globally Optimal Surface Segmentation using Deep Learning with Learnable Smoothness Priors ⬇️

Automated surface segmentation is important and challenging in many medical image analysis applications. Recent deep learning based methods have been developed for various object segmentation tasks. Most of them are a classification based approach, e.g. U-net, which predicts the probability of being target object or background for each voxel. One problem of those methods is lacking of topology guarantee for segmented objects, and usually post processing is needed to infer the boundary surface of the object. In this paper, a novel model based on convolutional neural network (CNN) followed by a learnable surface smoothing block is proposed to tackle the surface segmentation problem with end-to-end training. To the best of our knowledge, this is the first study to learn smoothness priors end-to-end with CNN for direct surface segmentation with global optimality. Experiments carried out on Spectral Domain Optical Coherence Tomography (SD-OCT) retinal layer segmentation and Intravascular Ultrasound (IVUS) vessel wall segmentation demonstrated very promising results.

42.Spot the conversation: speaker diarisation in the wild ⬇️

The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions.

43.Student-Teacher Curriculum Learning via Reinforcement Learning: Predicting Hospital Inpatient Admission Location ⬇️

Accurate and reliable prediction of hospital admission location is important due to resource-constraints and space availability in a clinical setting, particularly when dealing with patients who come from the emergency department. In this work we propose a student-teacher network via reinforcement learning to deal with this specific problem. A representation of the weights of the student network is treated as the state and is fed as an input to the teacher network. The teacher network's action is to select the most appropriate batch of data to train the student network on from a training set sorted according to entropy. By validating on three datasets, not only do we show that our approach outperforms state-of-the-art methods on tabular data and performs competitively on image recognition, but also that novel curricula are learned by the teacher network. We demonstrate experimentally that the teacher network can actively learn about the student network and guide it to achieve better performance than if trained alone.

44.A Brief Review of Deep Multi-task Learning and Auxiliary Task Learning ⬇️

Multi-task learning (MTL) optimizes several learning tasks simultaneously and leverages their shared information to improve generalization and the prediction of the model for each task. Auxiliary tasks can be added to the main task to ultimately boost the performance. In this paper, we provide a brief review on the recent deep multi-task learning (dMTL) approaches followed by methods on selecting useful auxiliary tasks that can be used in dMTL to improve the performance of the model for the main task.

45.Evaluation of Contemporary Convolutional Neural Network Architectures for Detecting COVID-19 from Chest Radiographs ⬇️

Interpreting chest radiograph, a.ka. chest x-ray, images is a necessary and crucial diagnostic tool used by medical professionals to detect and identify many diseases that may plague a patient. Although the images themselves contain a wealth of valuable information, their usefulness may be limited by how well they are interpreted, especially when the reviewing radiologist may be fatigued or when or an experienced radiologist is unavailable. Research in the use of deep learning models to analyze chest radiographs yielded impressive results where, in some instances, the models outperformed practicing radiologists. Amidst the COVID-19 pandemic, researchers have explored and proposed the use of said deep models to detect COVID-19 infections from radiographs as a possible way to help ease the strain on medical resources. In this study, we train and evaluate three model architectures, proposed for chest radiograph analysis, under varying conditions, find issues that discount the impressive model performances proposed by contemporary studies on this subject, and propose methodologies to train models that yield more reliable results.. Code, scripts, pre-trained models, and visualizations are available at this https URL.

46.Scene Graph Reasoning for Visual Question Answering ⬇️

Visual question answering is concerned with answering free-form questions about an image. Since it requires a deep linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires techniques from both computer vision and natural language processing. We propose a novel method that approaches the task by performing context-driven, sequential reasoning based on the objects and their semantic and spatial relationships present in the scene. As a first step, we derive a scene graph which describes the objects in the image, as well as their attributes and their mutual relationships. A reinforcement agent then learns to autonomously navigate over the extracted scene graph to generate paths, which are then the basis for deriving answers. We conduct a first experimental study on the challenging GQA dataset with manually curated scene graphs, where our method almost reaches the level of human performance.

47.4D Spatio-Temporal Convolutional Networks for Object Position Estimation in OCT Volumes ⬇️

Tracking and localizing objects is a central problem in computer-assisted surgery. Optical coherence tomography (OCT) can be employed as an optical tracking system, due to its high spatial and temporal resolution. Recently, 3D convolutional neural networks (CNNs) have shown promising performance for pose estimation of a marker object using single volumetric OCT images. While this approach relied on spatial information only, OCT allows for a temporal stream of OCT image volumes capturing the motion of an object at high volumes rates. In this work, we systematically extend 3D CNNs to 4D spatio-temporal CNNs to evaluate the impact of additional temporal information for marker object tracking. Across various architectures, our results demonstrate that using a stream of OCT volumes and employing 4D spatio-temporal convolutions leads to a 30% lower mean absolute error compared to single volume processing with 3D CNNs.

48.Spectral-Spatial Recurrent-Convolutional Networks for In-Vivo Hyperspectral Tumor Type Classification ⬇️

Early detection of cancerous tissue is crucial for long-term patient survival. In the head and neck region, a typical diagnostic procedure is an endoscopic intervention where a medical expert manually assesses tissue using RGB camera images. While healthy and tumor regions are generally easier to distinguish, differentiating benign and malignant tumors is very challenging. This requires an invasive biopsy, followed by histological evaluation for diagnosis. Also, during tumor resection, tumor margins need to be verified by histological analysis. To avoid unnecessary tissue resection, a non-invasive, image-based diagnostic tool would be very valuable. Recently, hyperspectral imaging paired with deep learning has been proposed for this task, demonstrating promising results on ex-vivo specimens. In this work, we demonstrate the feasibility of in-vivo tumor type classification using hyperspectral imaging and deep learning. We analyze the value of using multiple hyperspectral bands compared to conventional RGB images and we study several machine learning models' ability to make use of the additional spectral information. Based on our insights, we address spectral and spatial processing using recurrent-convolutional models for effective spectral aggregating and spatial feature learning. Our best model achieves an AUC of 76.3%, significantly outperforming previous conventional and deep learning methods.

49.A Novel DNN Training Framework via Data Sampling and Multi-Task Optimization ⬇️

Conventional DNN training paradigms typically rely on one training set and one validation set, obtained by partitioning an annotated dataset used for training, namely gross training set, in a certain way. The training set is used for training the model while the validation set is used to estimate the generalization performance of the trained model as the training proceeds to avoid over-fitting. There exist two major issues in this paradigm. Firstly, the validation set may hardly guarantee an unbiased estimate of generalization performance due to potential mismatching with test data. Secondly, training a DNN corresponds to solve a complex optimization problem, which is prone to getting trapped into inferior local optima and thus leads to undesired training results. To address these issues, we propose a novel DNN training framework. It generates multiple pairs of training and validation sets from the gross training set via random splitting, trains a DNN model of a pre-specified structure on each pair while making the useful knowledge (e.g., promising network parameters) obtained from one model training process to be transferred to other model training processes via multi-task optimization, and outputs the best, among all trained models, which has the overall best performance across the validation sets from all pairs. The knowledge transfer mechanism featured in this new framework can not only enhance training effectiveness by helping the model training process to escape from local optima but also improve on generalization performance via implicit regularization imposed on one model training process from other model training processes. We implement the proposed framework, parallelize the implementation on a GPU cluster, and apply it to train several widely used DNN models. Experimental results demonstrate the superiority of the proposed framework over the conventional training paradigm.

50.PGD-UNet: A Position-Guided Deformable Network for Simultaneous Segmentation of Organs and Tumors ⬇️

Precise segmentation of organs and tumors plays a crucial role in clinical applications. It is a challenging task due to the irregular shapes and various sizes of organs and tumors as well as the significant class imbalance between the anatomy of interest (AOI) and the background region. In addition, in most situation tumors and normal organs often overlap in medical images, but current approaches fail to delineate both tumors and organs accurately. To tackle such challenges, we propose a position-guided deformable UNet, namely PGD-UNet, which exploits the spatial deformation capabilities of deformable convolution to deal with the geometric transformation of both organs and tumors. Position information is explicitly encoded into the network to enhance the capabilities of deformation. Meanwhile, we introduce a new pooling module to preserve position information lost in conventional max-pooling operation. Besides, due to unclear boundaries between different structures as well as the subjectivity of annotations, labels are not necessarily accurate for medical image segmentation tasks. It may cause the overfitting of the trained network due to label noise. To address this issue, we formulate a novel loss function to suppress the influence of potential label noise on the training process. Our method was evaluated on two challenging segmentation tasks and achieved very promising segmentation accuracy in both tasks.

51.MPLP: Learning a Message Passing Learning Protocol ⬇️

We present a novel method for learning the weights of an artificial neural network - a Message Passing Learning Protocol (MPLP). In MPLP, we abstract every operations occurring in ANNs as independent agents. Each agent is responsible for ingesting incoming multidimensional messages from other agents, updating its internal state, and generating multidimensional messages to be passed on to neighbouring agents. We demonstrate the viability of MPLP as opposed to traditional gradient-based approaches on simple feed-forward neural networks, and present a framework capable of generalizing to non-traditional neural network architectures. MPLP is meta learned using end-to-end gradient-based meta-optimisation. We further discuss the observed properties of MPLP and hypothesize its applicability on various fields of deep learning.

52.Iterative Bounding Box Annotation for Object Detection ⬇️

Manual annotation of bounding boxes for object detection in digital images is tedious, and time and resource consuming. In this paper, we propose a semi-automatic method for efficient bounding box annotation. The method trains the object detector iteratively on small batches of labeled images and learns to propose bounding boxes for the next batch, after which the human annotator only needs to correct possible errors. We propose an experimental setup for simulating the human actions and use it for comparing different iteration strategies, such as the order in which the data is presented to the annotator. We experiment on our method with three datasets and show that it can reduce the human annotation effort significantly, saving up to 75% of total manual annotation work.

53.Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights ⬇️

Machine learning (ML) models are widely used in many domains including media processing and generation, computer vision, medical diagnosis, embedded systems, high-performance and scientific computing, and recommendation systems. For efficiently processing these computational- and memory-intensive applications, tensors of these over-parameterized models are compressed by leveraging sparsity, size reduction, and quantization of tensors. Unstructured sparsity and tensors with varying dimensions yield irregular-shaped computation, communication, and memory access patterns; processing them on hardware accelerators in a conventional manner does not inherently leverage acceleration opportunities. This paper provides a comprehensive survey on how to efficiently execute sparse and irregular tensor computations of ML models on hardware accelerators. In particular, it discusses additional enhancement modules in architecture design and software support; categorizes different hardware designs and acceleration techniques and analyzes them in terms of hardware and execution costs; highlights further opportunities in terms of hardware/software/algorithm co-design optimizations and joint optimizations among described hardware and software enhancement modules. The takeaways from this paper include: understanding the key challenges in accelerating sparse, irregular-shaped, and quantized tensors; understanding enhancements in acceleration systems for supporting their efficient computations; analyzing trade-offs in opting for a specific type of design enhancement; understanding how to map and compile models with sparse tensors on the accelerators; understanding recent design trends for efficient accelerations and further opportunities.

54.An encoder-decoder-based method for COVID-19 lung infection segmentation ⬇️

The novelty of the COVID-19 disease and the speed of spread has created a colossal chaos, impulse among researchers worldwide to exploit all the resources and capabilities to understand and analyze characteristics of the coronavirus in term of the ways it spreads and virus incubation time. For that, the existing medical features like CT and X-ray images are used. For example, CT-scan images can be used for the detection of lung infection. But the challenges of these features such as the quality of the image and infection characteristics limitate the effectiveness of these features. Using artificial intelligence (AI) tools and computer vision algorithms, the accuracy of detection can be more accurate and can help to overcome these issues. This paper proposes a multi-task deep-learning-based method for lung infection segmentation using CT-scan images. Our proposed method starts by segmenting the lung regions that can be infected. Then, segmenting the infections in these regions. Also, to perform a multi-class segmentation the proposed model is trained using the two-stream inputs. The multi-task learning used in this paper allows us to overcome shortage of labeled data. Also, the multi-input stream allows the model to do the learning on many features that can improve the results. To evaluate the proposed method, many features have been used. Also, from the experiments, the proposed method can segment lung infections with a high degree performance even with shortage of data and labeled images. In addition, comparing with the state-of-the-art method our method achieves good performance results.

55.Surface Denoising based on Normal Filtering in a Robust Statistics Framework ⬇️

During a surface acquisition process using 3D scanners, noise is inevitable and an important step in geometry processing is to remove these noise components from these surfaces (given as points-set or triangulated mesh). The noise-removal process (denoising) can be performed by filtering the surface normals first and by adjusting the vertex positions according to filtered normals afterwards. Therefore, in many available denoising algorithms, the computation of noise-free normals is a key factor. A variety of filters have been introduced for noise-removal from normals, with different focus points like robustness against outliers or large amplitude of noise. Although these filters are performing well in different aspects, a unified framework is missing to establish the relation between them and to provide a theoretical analysis beyond the performance of each method.
In this paper, we introduce such a framework to establish relations between a number of widely-used nonlinear filters for face normals in mesh denoising and vertex normals in point set denoising. We cover robust statistical estimation with M-smoothers and their application to linear and non-linear normal filtering. Although these methods originate in different mathematical theories - which include diffusion-, bilateral-, and directional curvature-based algorithms - we demonstrate that all of them can be cast into a unified framework of robust statistics using robust error norms and their corresponding influence functions. This unification contributes to a better understanding of the individual methods and their relations with each other. Furthermore, the presented framework provides a platform for new techniques to combine the advantages of known filters and to compare them with available methods.

56.Uncertainty-Guided Efficient Interactive Refinement of Fetal Brain Segmentation from Stacks of MRI Slices ⬇️

Segmentation of the fetal brain from stacks of motion-corrupted fetal MRI slices is important for motion correction and high-resolution volume reconstruction. Although Convolutional Neural Networks (CNNs) have been widely used for automatic segmentation of the fetal brain, their results may still benefit from interactive refinement for challenging slices. To improve the efficiency of interactive refinement process, we propose an Uncertainty-Guided Interactive Refinement (UGIR) framework. We first propose a grouped convolution-based CNN to obtain multiple automatic segmentation predictions with uncertainty estimation in a single forward pass, then guide the user to provide interactions only in a subset of slices with the highest uncertainty. A novel interactive level set method is also proposed to obtain a refined result given the initial segmentation and user interactions. Experimental results show that: (1) our proposed CNN obtains uncertainty estimation in real time which correlates well with mis-segmentations, (2) the proposed interactive level set is effective and efficient for refinement, (3) UGIR obtains accurate refinement results with around 30% improvement of efficiency by using uncertainty to guide user interactions. Our code is available online.

57.NP-PROV: Neural Processes with Position-Relevant-Only Variances ⬇️

Neural Processes (NPs) families encode distributions over functions to a latent representation, given context data, and decode posterior mean and variance at unknown locations. Since mean and variance are derived from the same latent space, they may fail on out-of-domain tasks where fluctuations in function values amplify the model uncertainty. We present a new member named Neural Processes with Position-Relevant-Only Variances (NP-PROV). NP-PROV hypothesizes that a target point close to a context point has small uncertainty, regardless of the function value at that position. The resulting approach derives mean and variance from a function-value-related space and a position-related-only latent space separately. Our evaluation on synthetic and real-world datasets reveals that NP-PROV can achieve state-of-the-art likelihood while retaining a bounded variance when drifts exist in the function value.

58.Rapid tissue oxygenation mapping from snapshot structured-light images with adversarial deep learning ⬇️

Spatial frequency domain imaging (SFDI) is a powerful technique for mapping tissue oxygen saturation over a wide field of view. However, current SFDI methods either require a sequence of several images with different illumination patterns or, in the case of single snapshot optical properties (SSOP), introduce artifacts and sacrifice accuracy. To avoid this tradeoff, we introduce OxyGAN: a data-driven, content-aware method to estimate tissue oxygenation directly from single structured light images using end-to-end generative adversarial networks. Conventional SFDI is used to obtain ground truth tissue oxygenation maps for ex vivo human esophagi, in vivo hands and feet, and an in vivo pig colon sample under 659 nm and 851 nm sinusoidal illumination. We benchmark OxyGAN by comparing to SSOP and to a two-step hybrid technique that uses a previously-developed deep learning model to predict optical properties followed by a physical model to calculate tissue oxygenation. When tested on human feet, a cross-validated OxyGAN maps tissue oxygenation with an accuracy of 96.5%. When applied to sample types not included in the training set, such as human hands and pig colon, OxyGAN achieves a 93.0% accuracy, demonstrating robustness to various tissue types. On average, OxyGAN outperforms SSOP and a hybrid model in estimating tissue oxygenation by 24.9% and 24.7%, respectively. Lastly, we optimize OxyGAN inference so that oxygenation maps are computed ~10 times faster than previous work, enabling video-rate, 25Hz imaging. Due to its rapid acquisition and processing speed, OxyGAN has the potential to enable real-time, high-fidelity tissue oxygenation mapping that may be useful for many clinical applications.

59.Deep learning-based holographic polarization microscopy ⬇️

Polarized light microscopy provides high contrast to birefringent specimen and is widely used as a diagnostic tool in pathology. However, polarization microscopy systems typically operate by analyzing images collected from two or more light paths in different states of polarization, which lead to relatively complex optical designs, high system costs or experienced technicians being required. Here, we present a deep learning-based holographic polarization microscope that is capable of obtaining quantitative birefringence retardance and orientation information of specimen from a phase recovered hologram, while only requiring the addition of one polarizer/analyzer pair to an existing holographic imaging system. Using a deep neural network, the reconstructed holographic images from a single state of polarization can be transformed into images equivalent to those captured using a single-shot computational polarized light microscope (SCPLM). Our analysis shows that a trained deep neural network can extract the birefringence information using both the sample specific morphological features as well as the holographic amplitude and phase distribution. To demonstrate the efficacy of this method, we tested it by imaging various birefringent samples including e.g., monosodium urate (MSU) and triamcinolone acetonide (TCA) crystals. Our method achieves similar results to SCPLM both qualitatively and quantitatively, and due to its simpler optical design and significantly larger field-of-view, this method has the potential to expand the access to polarization microscopy and its use for medical diagnosis in resource limited settings.

60.Adversarial Example Games ⬇️

The existence of adversarial examples capable of fooling trained neural network classifiers calls for a much better understanding of possible attacks, in order to guide the development of safeguards against them. It includes attack methods in the highly challenging non-interactive blackbox setting, where adversarial attacks are generated without any access, including queries, to the target model. Prior works in this setting have relied mainly on algorithmic innovations derived from empirical observations (e.g., that momentum helps), and the field currently lacks a firm theoretical basis for understanding transferability in adversarial attacks. In this work, we address this gap and lay the theoretical foundations for crafting transferable adversarial examples to entire function classes. We introduce Adversarial Examples Games (AEG), a novel framework that models adversarial examples as two-player min-max games between an attack generator and a representative classifier. We prove that the saddle point of an AEG game corresponds to a generating distribution of adversarial examples against entire function classes. Training the generator only requires the ability to optimize a representative classifier from a given hypothesis class, enabling BlackBox transfer to unseen classifiers from the same class. We demonstrate the efficacy of our approach on the MNIST and CIFAR-10 datasets against both undefended and robustified models, achieving competitive performance with state-of-the-art BlackBox transfer approaches.