Skip to content

Latest commit

 

History

History
97 lines (97 loc) · 60 KB

20200629.md

File metadata and controls

97 lines (97 loc) · 60 KB

ArXiv cs.CV --Mon, 29 Jun 2020

1.An Advert Creation System for 3D Product Placements ⬇️

Over the past decade, the evolution of video-sharing platforms has attracted a significant amount of investments on contextual advertising. The common contextual advertising platforms utilize the information provided by users to integrate 2D visual ads into videos. The existing platforms face many technical challenges such as ad integration with respect to occluding objects and 3D ad placement. This paper presents a Video Advertisement Placement & Integration (Adverts) framework, which is capable of perceiving the 3D geometry of the scene and camera motion to blend 3D virtual objects in videos and create the illusion of reality. The proposed framework contains several modules such as monocular depth estimation, object segmentation, background-foreground separation, alpha matting and camera tracking. Our experiments conducted using Adverts framework indicates the significant potential of this system in contextual ad integration, and pushing the limits of advertising industry using mixed reality technologies.

2.Person Re-identification by analyzing Dynamic Variations in Gait Sequences ⬇️

Gait recognition is a biometric technology that identifies individuals in a video sequence by analysing their style of walking or limb movement. However, this identification is generally sensitive to appearance changes and conventional feature descriptors such as Gait Energy Image (GEI) lose some of the dynamic information in the gait sequence. Active Energy Image (AEI) focuses more on dynamic motion changes than GEI and is more suited to deal with appearance changes. We propose a new approach, which allows recognizing people by analysing the dynamic motion variations and identifying people without using a database of predicted changes. In the proposed method, the active energy image is calculated by averaging the difference frames of the silhouette sequence and divided into multiple segments. Affine moment invariants are computed as gait features for each section. Next, matching weights are calculated based on the similarity between extracted features and those in the database. Finally, the subject is identified by the weighted combination of similarities in all segments. The CASIA-B Gait Database is used as the principal dataset for the experimental analysis.

3.ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural Networks ⬇️

The capability of the self-attention mechanism to model the long-range dependencies has catapulted its deployment in vision models. Unlike convolution operators, self-attention offers infinite receptive field and enables compute-efficient modeling of global dependencies. However, the existing state-of-the-art attention mechanisms incur high compute and/or parameter overheads, and hence unfit for compact convolutional neural networks (CNNs). In this work, we propose a simple yet effective "Ultra-Lightweight Subspace Attention Mechanism" (ULSAM), which infers different attention maps for each feature map subspace. We argue that leaning separate attention maps for each feature subspace enables multi-scale and multi-frequency feature representation, which is more desirable for fine-grained image classification. Our method of subspace attention is orthogonal and complementary to the existing state-of-the-arts attention mechanisms used in vision models. ULSAM is end-to-end trainable and can be deployed as a plug-and-play module in the pre-existing compact CNNs. Notably, our work is the first attempt that uses a subspace attention mechanism to increase the efficiency of compact CNNs. To show the efficacy of ULSAM, we perform experiments with MobileNet-V1 and MobileNet-V2 as backbone architectures on ImageNet-1K and three fine-grained image classification datasets. We achieve $\approx$13% and $\approx$25% reduction in both the FLOPs and parameter counts of MobileNet-V2 with a 0.27% and more than 1% improvement in top-1 accuracy on the ImageNet-1K and fine-grained image classification datasets (respectively). Code and trained models are available at this https URL.

4.End-to-end training of deep kernel map networks for image classification ⬇️

Deep kernel map networks have shown excellent performances in various classification problems including image annotation. Their general recipe consists in aggregating several layers of singular value decompositions (SVDs) -- that map data from input spaces into high dimensional spaces -- while preserving the similarity of the underlying kernels. However, the potential of these deep map networks has not been fully explored as the original setting of these networks focuses mainly on the approximation quality of their kernels and ignores their discrimination power. In this paper, we introduce a novel "end-to-end" design for deep kernel map learning that balances the approximation quality of kernels and their discrimination power. Our method proceeds in two steps; first, layerwise SVD is applied in order to build initial deep kernel map approximations and then an "end-to-end" supervised learning is employed to further enhance their discrimination power while maintaining their efficiency. Extensive experiments, conducted on the challenging ImageCLEF annotation benchmark, show the high efficiency and the out-performance of this two-step process with respect to different related methods.

5.Cross-Ssupervised Object Detection ⬇️

After learning a new object category from image-level annotations (with no object bounding boxes), humans are remarkably good at precisely localizing those objects. However, building good object localizers (i.e., detectors) currently requires expensive instance-level annotations. While some work has been done on learning detectors from weakly labeled samples (with only class labels), these detectors do poorly at localization. In this work, we show how to build better object detectors from weakly labeled images of new categories by leveraging knowledge learned from fully labeled base categories. We call this novel learning paradigm cross-supervised object detection. We propose a unified framework that combines a detection head trained from instance-level annotations and a recognition head learned from image-level annotations, together with a spatial correlation module that bridges the gap between detection and recognition. These contributions enable us to better detect novel objects with image-level annotations in complex multi-object scenes such as the COCO dataset.

6.High Resolution Zero-Shot Domain Adaptation of Synthetically Rendered Face Images ⬇️

Generating photorealistic images of human faces at scale remains a prohibitively difficult task using computer graphics approaches. This is because these require the simulation of light to be photorealistic, which in turn requires physically accurate modelling of geometry, materials, and light sources, for both the head and the surrounding scene. Non-photorealistic renders however are increasingly easy to produce. In contrast to computer graphics approaches, generative models learned from more readily available 2D image data have been shown to produce samples of human faces that are hard to distinguish from real data. The process of learning usually corresponds to a loss of control over the shape and appearance of the generated images. For instance, even simple disentangling tasks such as modifying the hair independently of the face, which is trivial to accomplish in a computer graphics approach, remains an open research question. In this work, we propose an algorithm that matches a non-photorealistic, synthetically generated image to a latent vector of a pretrained StyleGAN2 model which, in turn, maps the vector to a photorealistic image of a person of the same pose, expression, hair, and lighting. In contrast to most previous work, we require no synthetic training data. To the best of our knowledge, this is the first algorithm of its kind to work at a resolution of 1K and represents a significant leap forward in visual realism.

7.SASO: Joint 3D Semantic-Instance Segmentation via Multi-scale Semantic Association and Salient Point Clustering Optimization ⬇️

We propose a novel 3D point cloud segmentation framework named SASO, which jointly performs semantic and instance segmentation tasks. For semantic segmentation task, inspired by the inherent correlation among objects in spatial context, we propose a Multi-scale Semantic Association (MSA) module to explore the constructive effects of the semantic context information. For instance segmentation task, different from previous works that utilize clustering only in inference procedure, we propose a Salient Point Clustering Optimization (SPCO) module to introduce a clustering procedure into the training process and impel the network focusing on points that are difficult to be distinguished. In addition, because of the inherent structures of indoor scenes, the imbalance problem of the category distribution is rarely considered but severely limits the performance of 3D scene perception. To address this issue, we introduce an adaptive Water Filling Sampling (WFS) algorithm to balance the category distribution of training data. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods on benchmark datasets in both semantic segmentation and instance segmentation tasks.

8.Suggestive Annotation of Brain Tumour Images with Gradient-guided Sampling ⬇️

Machine learning has been widely adopted for medical image analysis in recent years given its promising performance in image segmentation and classification tasks. As a data-driven science, the success of machine learning, in particular supervised learning, largely depends on the availability of manually annotated datasets. For medical imaging applications, such annotated datasets are not easy to acquire. It takes a substantial amount of time and resource to curate an annotated medical image set. In this paper, we propose an efficient annotation framework for brain tumour images that is able to suggest informative sample images for human experts to annotate. Our experiments show that training a segmentation model with only 19% suggestively annotated patient scans from BraTS 2019 dataset can achieve a comparable performance to training a model on the full dataset for whole tumour segmentation task. It demonstrates a promising way to save manual annotation cost and improve data efficiency in medical imaging applications.

9.Fast Multi-Level Foreground Estimation ⬇️

Alpha matting aims to estimate the translucency of an object in a given image. The resulting alpha matte describes pixel-wise to what amount foreground and background colors contribute to the color of the composite image. While most methods in literature focus on estimating the alpha matte, the process of estimating the foreground colors given the input image and its alpha matte is often neglected, although foreground estimation is an essential part of many image editing workflows. In this work, we propose a novel method for foreground estimation given the alpha matte. We demonstrate that our fast multi-level approach yields results that are comparable with the state-of-the-art while outperforming those methods in computational runtime and memory usage.

10.Ensemble Transfer Learning for Emergency Landing Field Identification on Moderate Resource Heterogeneous Kubernetes Cluster ⬇️

The full loss of thrust of an aircraft requires fast and reliable decisions of the pilot. If no published landing field is within reach, an emergency landing field must be selected. The choice of a suitable emergency landing field denotes a crucial task to avoid unnecessary damage of the aircraft, risk for the civil population as well as the crew and all passengers on board. Especially in case of instrument meteorological conditions it is indispensable to use a database of suitable emergency landing fields. Thus, based on public available digital orthographic photos and digital surface models, we created various datasets with different sample sizes to facilitate training and testing of neural networks. Each dataset consists of a set of data layers. The best compositions of these data layers as well as the best performing transfer learning models are selected. Subsequently, certain hyperparameters of the chosen models for each sample size are optimized with Bayesian and Bandit optimization. The hyperparameter tuning is performed with a self-made Kubernetes cluster. The models outputs were investigated with respect to the input data by the utilization of layer-wise relevance propagation. With optimized models we created an ensemble model to improve the segmentation performance. Finally, an area around the airport of Arnsberg in North Rhine-Westphalia was segmented and emergency landing fields are identified, while the verification of the final approach's obstacle clearance is left unconsidered. These emergency landing fields are stored in a PostgreSQL database.

11.RPM-Net: Recurrent Prediction of Motion and Parts from Point Cloud ⬇️

We introduce RPM-Net, a deep learning-based approach which simultaneously infers movable parts and hallucinates their motions from a single, un-segmented, and possibly partial, 3D point cloud shape. RPM-Net is a novel Recurrent Neural Network (RNN), composed of an encoder-decoder pair with interleaved Long Short-Term Memory (LSTM) components, which together predict a temporal sequence of pointwise displacements for the input point cloud. At the same time, the displacements allow the network to learn movable parts, resulting in a motion-based shape segmentation. Recursive applications of RPM-Net on the obtained parts can predict finer-level part motions, resulting in a hierarchical object segmentation. Furthermore, we develop a separate network to estimate part mobilities, e.g., per-part motion parameters, from the segmented motion sequence. Both networks learn deep predictive models from a training set that exemplifies a variety of mobilities for diverse objects. We show results of simultaneous motion and part predictions from synthetic and real scans of 3D objects exhibiting a variety of part mobilities, possibly involving multiple movable parts.

12.Domain Contrast for Domain Adaptive Object Detection ⬇️

We present Domain Contrast (DC), a simple yet effective approach inspired by contrastive learning for training domain adaptive detectors. DC is deduced from the error bound minimization perspective of a transferred model, and is implemented with cross-domain contrast loss which is plug-and-play. By minimizing cross-domain contrast loss, DC guarantees the transferability of detectors while naturally alleviating the class imbalance issue in the target domain. DC can be applied at either image level or region level, consistently improving detectors' transferability and discriminability. Extensive experiments on commonly used benchmarks show that DC improves the baseline and state-of-the-art by significant margins, while demonstrating great potential for large domain divergence.

13.Designing and Learning Trainable Priors with Non-Cooperative Games ⬇️

We introduce a general framework for designing and learning neural networks whose forward passes can be interpreted as solving convex optimization problems, and whose architectures are derived from an optimization algorithm. We focus on non-cooperative convex games, solved by local agents represented by the nodes of a graph and interacting through regularization functions. This approach is appealing for solving imaging problems, as it allows the use of classical image priors within deep models that are trainable end to end. The priors used in this presentation include variants of total variation, Laplacian regularization, sparse coding on learned dictionaries, and non-local self similarities. Our models are parameter efficient and fully interpretable, and our experiments demonstrate their effectiveness on a large diversity of tasks ranging from image denoising and compressed sensing for fMRI to dense stereo matching.

14.AutoSNAP: Automatically Learning Neural Architectures for Instrument Pose Estimation ⬇️

Despite recent successes, the advances in Deep Learning have not yet been fully translated to Computer Assisted Intervention (CAI) problems such as pose estimation of surgical instruments. Currently, neural architectures for classification and segmentation tasks are adopted ignoring significant discrepancies between CAI and these tasks. We propose an automatic framework (AutoSNAP) for instrument pose estimation problems, which discovers and learns the architectures for neural networks. We introduce 1)~an efficient testing environment for pose estimation, 2)~a powerful architecture representation based on novel Symbolic Neural Architecture Patterns (SNAPs), and 3)~an optimization of the architecture using an efficient search scheme. Using AutoSNAP, we discover an improved architecture (SNAPNet) which outperforms both the hand-engineered i3PosNet and the state-of-the-art architecture search method DARTS.

15.An Automatic Reader of Identity Documents ⬇️

Identity documents automatic reading and verification is an appealing technology for nowadays service industry, since this task is still mostly performed manually, leading to waste of economic and time resources. In this paper the prototype of a novel automatic reading system of identity documents is presented. The system has been thought to extract data of the main Italian identity documents from photographs of acceptable quality, like those usually required to online subscribers of various services. The document is first localized inside the photo, and then classified; finally, text recognition is executed. A synthetic dataset has been used, both for neural networks training, and for performance evaluation of the system. The synthetic dataset avoided privacy issues linked to the use of real photos of real documents, which will be used, instead, for future developments of the system.

16.Expandable YOLO: 3D Object Detection from RGB-D Images ⬇️

This paper aims at constructing a light-weight object detector that inputs a depth and a color image from a stereo camera. Specifically, by extending the network architecture of YOLOv3 to 3D in the middle, it is possible to output in the depth direction. In addition, Intersection over Uninon (IoU) in 3D space is introduced to confirm the accuracy of region extraction results. In the field of deep learning, object detectors that use distance information as input are actively studied for utilizing automated driving. However, the conventional detector has a large network structure, and the real-time property is impaired. The effectiveness of the detector constructed as described above is verified using datasets. As a result of this experiment, the proposed model is able to output 3D bounding boxes and detect people whose part of the body is hidden. Further, the processing speed of the model is 44.35 fps.

17.Few-Shot Anomaly Detection for Polyp Frames from Colonoscopy ⬇️

Anomaly detection methods generally target the learning of a normal image distribution (i.e., inliers showing healthy cases) and during testing, samples relatively far from the learned distribution are classified as anomalies (i.e., outliers showing disease cases). These approaches tend to be sensitive to outliers that lie relatively close to inliers (e.g., a colonoscopy image with a small polyp). In this paper, we address the inappropriate sensitivity to outliers by also learning from inliers. We propose a new few-shot anomaly detection method based on an encoder trained to maximise the mutual information between feature embeddings and normal images, followed by a few-shot score inference network, trained with a large set of inliers and a substantially smaller set of outliers. We evaluate our proposed method on the clinical problem of detecting frames containing polyps from colonoscopy video sequences, where the training set has 13350 normal images (i.e., without polyps) and less than 100 abnormal images (i.e., with polyps). The results of our proposed model on this data set reveal a state-of-the-art detection result, while the performance based on different number of anomaly samples is relatively stable after approximately 40 abnormal training images.

18.Text Detection on Roughly Placed Books by Leveraging a Learning-based Model Trained with Another Domain Data ⬇️

Text detection enables us to extract rich information from images. In this paper, we focus on how to generate bounding boxes that are appropriate to grasp text areas on books to help implement automatic text detection. We attempt not to improve a learning-based model by training it with an enough amount of data in the target domain but to leverage it, which has been already trained with another domain data. We develop algorithms that construct the bounding boxes by improving and leveraging the results of a learning-based method. Our algorithms can utilize different learning-based approaches to detect scene texts. Experimental evaluations demonstrate that our algorithms work well in various situations where books are roughly placed.

19.Ricci Curvature Based Volumetric Segmentation of the Auditory Ossicles ⬇️

The auditory ossicles that are located in the middle ear are the smallest bones in the human body. Their damage will result in hearing loss. It is therefore important to be able to automatically diagnose ossicles' diseases based on Computed Tomography (CT) 3D imaging. However CT images usually include the whole head area, which is much larger than the bones of interest, thus the localization of the ossicles, followed by segmentation, both play a significant role in automatic diagnosis. The commonly employed local segmentation methods require manually selected initial points, which is a highly time consuming process. We therefore propose a completely automatic method to locate the ossicles which requires neither templates, nor manual labels. It relies solely on the connective properties of the auditory ossicles themselves, and their relationship with the surrounding tissue fluid. For the segmentation task, we define a novel energy function and obtain the shape of the ossicles from the 3D CT image by minimizing this new energy. Compared to the state-of-the-art methods which usually use the gradient operator and some normalization terms, we propose to add a Ricci curvature term to the commonly employed energy function. We compare our proposed method with the state-of-the-art methods and show that the performance of discrete Forman-Ricci curvature is superior to the others.

20.Unsupervised Discovery of Object Landmarks via Contrastive Learning ⬇️

Given a collection of images, humans are able to discover landmarks of the depicted objects by modeling the shared geometric structure across instances. This idea of geometric equivariance has been widely used for unsupervised discovery of object landmark representations. In this paper, we develop a simple and effective approach based on contrastive learning of invariant representations. We show that when a deep network is trained to be invariant to geometric and photometric transformations, representations from its intermediate layers are highly predictive of object landmarks. Furthermore, by stacking representations across layers in a hypercolumn their effectiveness can be improved. Our approach is motivated by the phenomenon of the gradual emergence of invariance in the representation hierarchy of a deep network. We also present a unified view of existing equivariant and invariant representation learning approaches through the lens of contrastive learning, shedding light on the nature of invariances learned. Experiments on standard benchmarks for landmark discovery, as well as a challenging one we propose, show that the proposed approach surpasses prior state-of-the-art.

21.Blind Image Deconvolution using Student's-t Prior with Overlapping Group Sparsity ⬇️

In this paper, we solve blind image deconvolution problem that is to remove blurs form a signal degraded image without any knowledge of the blur kernel. Since the problem is ill-posed, an image prior plays a significant role in accurate blind deconvolution. Traditional image prior assumes coefficients in filtered domains are sparse. However, it is assumed here that there exist additional structures over the sparse coefficients. Accordingly, we propose new problem formulation for the blind image deconvolution, which utilizes the structural information by coupling Student's-t image prior with overlapping group sparsity. The proposed method resulted in an effective blind deconvolution algorithm that outperforms other state-of-the-art algorithms.

22.Pushing the Limit of Unsupervised Learning for Ultrasound Image Artifact Removal ⬇️

Ultrasound (US) imaging is a fast and non-invasive imaging modality which is widely used for real-time clinical imaging applications without concerning about radiation hazard. Unfortunately, it often suffers from poor visual quality from various origins, such as speckle noises, blurring, multi-line acquisition (MLA), limited RF channels, small number of view angles for the case of plane wave imaging, etc. Classical methods to deal with these problems include image-domain signal processing approaches using various adaptive filtering and model-based approaches. Recently, deep learning approaches have been successfully used for ultrasound imaging field. However, one of the limitations of these approaches is that paired high quality images for supervised training are difficult to obtain in many practical applications. In this paper, inspired by the recent theory of unsupervised learning using optimal transport driven cycleGAN (OT-cycleGAN), we investigate applicability of unsupervised deep learning for US artifact removal problems without matched reference data. Experimental results for various tasks such as deconvolution, speckle removal, limited data artifact removal, etc. confirmed that our unsupervised learning method provides comparable results to supervised learning for many practical applications.

23.Lesion Mask-based Simultaneous Synthesis of Anatomic and MolecularMR Images using a GAN ⬇️

Data-driven automatic approaches have demonstrated their great potential in resolving various clinical diagnostic dilemmas for patients with malignant gliomas in neuro-oncology with the help of conventional and advanced molecular MR images. However, the lack of sufficient annotated MRI data has vastly impeded the development of such automatic methods. Conventional data augmentation approaches, including flipping, scaling, rotation, and distortion are not capable of generating data with diverse image content. In this paper, we propose a generative adversarial network (GAN), which can simultaneously synthesize data from arbitrary manipulated lesion information on multiple anatomic and molecular MRI sequences, including T1-weighted (T1w), gadolinium enhanced T1w (Gd-T1w), T2-weighted (T2w), fluid-attenuated inversion recovery (FLAIR), and amide proton transfer-weighted (APTw). The proposed framework consists of a stretch-out up-sampling module, a brain atlas encoder, a segmentation consistency module, and multi-scale labelwise discriminators. Extensive experiments on real clinical data demonstrate that the proposed model can perform significantly better than the state-of-the-art synthesis methods.

24.Meta Deformation Network: Meta Functionals for Shape Correspondence ⬇️

We present a new technique named "Meta Deformation Network" for 3D shape matching via deformation, in which a deep neural network maps a reference shape onto the parameters of a second neural network whose task is to give the correspondence between a learned template and query shape via deformation. We categorize the second neural network as a meta-function, or a function generated by another function, as its parameters are dynamically given by the first network on a per-input basis. This leads to a straightforward overall architecture and faster execution speeds, without loss in the quality of the deformation of the template. We show in our experiments that Meta Deformation Network leads to improvements on the MPI-FAUST Inter Challenge over designs that utilized a conventional decoder design that has non-dynamic parameters.

25.Deepfake Detection using Spatiotemporal Convolutional Networks ⬇️

Better generative models and larger datasets have led to more realistic fake videos that can fool the human eye but produce temporal and spatial artifacts that deep learning approaches can detect. Most current Deepfake detection methods only use individual video frames and therefore fail to learn from temporal information. We created a benchmark of the performance of spatiotemporal convolutional methods using the Celeb-DF dataset. Our methods outperformed state-of-the-art frame-based detection methods. Code for our paper is publicly available at this https URL.

26.Unsupervised Video Decomposition using Spatio-temporal Iterative Inference ⬇️

Unsupervised multi-object scene decomposition is a fast-emerging problem in representation learning. Despite significant progress in static scenes, such models are unable to leverage important dynamic cues present in video. We propose a novel spatio-temporal iterative inference framework that is powerful enough to jointly model complex multi-object representations and explicit temporal dependencies between latent variables across frames. This is achieved by leveraging 2D-LSTM, temporally conditioned inference and generation within the iterative amortized inference for posterior refinement. Our method improves the overall quality of decompositions, encodes information about the objects' dynamics, and can be used to predict trajectories of each object separately. Additionally, we show that our model has a high accuracy even without color information. We demonstrate the decomposition, segmentation, and prediction capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets, one of which was curated for this work and will be made publicly available.

27.CognitiveCNN: Mimicking Human Cognitive Models to resolve Texture-Shape Bias ⬇️

Recent works demonstrate the texture bias in Convolutional Neural Networks (CNNs), conflicting with early works claiming that networks identify objects using shape. It is commonly believed that the cost function forces the network to take a greedy route to increase accuracy using texture, failing to explore any global statistics. We propose a novel intuitive architecture, namely CognitiveCNN, inspired from feature integration theory in psychology to utilise human-interpretable feature like shape, texture, edges etc. to reconstruct, and classify the image. We define two metrics, namely TIC and RIC to quantify the importance of each stream using attention maps. We introduce a regulariser which ensures that the contribution of each feature is same for any task, as it is for reconstruction; and perform experiments to show the resulting boost in accuracy and robustness besides imparting explainability. Lastly, we adapt these ideas to conventional CNNs and propose Augmented Cognitive CNN to achieve superior performance in object recognition.

28.Investigating and Exploiting Image Resolution for Transfer Learning-based Skin Lesion Classification ⬇️

Skin cancer is among the most common cancer types. Dermoscopic image analysis improves the diagnostic accuracy for detection of malignant melanoma and other pigmented skin lesions when compared to unaided visual inspection. Hence, computer-based methods to support medical experts in the diagnostic procedure are of great interest. Fine-tuning pre-trained convolutional neural networks (CNNs) has been shown to work well for skin lesion classification. Pre-trained CNNs are usually trained with natural images of a fixed image size which is typically significantly smaller than captured skin lesion images and consequently dermoscopic images are downsampled for fine-tuning. However, useful medical information may be lost during this transformation. In this paper, we explore the effect of input image size on skin lesion classification performance of fine-tuned CNNs. For this, we resize dermoscopic images to different resolutions, ranging from 64x64 to 768x768 pixels and investigate the resulting classification performance of three well-established CNNs, namely DenseNet-121, ResNet-18, and ResNet-50. Our results show that using very small images (of size 64x64 pixels) degrades the classification performance, while images of size 128x128 pixels and above support good performance with larger image sizes leading to slightly improved classification. We further propose a novel fusion approach based on a three-level ensemble strategy that exploits multiple fine-tuned networks trained with dermoscopic images at various sizes. When applied on the ISIC 2017 skin lesion classification challenge, our fusion approach yields an area under the receiver operating characteristic curve of 89.2% and 96.6% for melanoma classification and seborrheic keratosis classification, respectively, outperforming state-of-the-art algorithms.

29.Perspective Plane Program Induction from a Single Image ⬇️

We study the inverse graphics problem of inferring a holistic representation for natural images. Given an input image, our goal is to induce a neuro-symbolic, program-like representation that jointly models camera poses, object locations, and global scene structures. Such high-level, holistic scene representations further facilitate low-level image manipulation tasks such as inpainting. We formulate this problem as jointly finding the camera pose and scene structure that best describe the input image. The benefits of such joint inference are two-fold: scene regularity serves as a new cue for perspective correction, and in turn, correct perspective correction leads to a simplified scene structure, similar to how the correct shape leads to the most regular texture in shape from texture. Our proposed framework, Perspective Plane Program Induction (P3I), combines search-based and gradient-based algorithms to efficiently solve the problem. P3I outperforms a set of baselines on a collection of Internet images, across tasks including camera pose estimation, global structure inference, and down-stream image manipulation tasks.

30.Learning Data Augmentation with Online Bilevel Optimization for Image Classification ⬇️

Data augmentation is a key practice in machine learning for improving generalization performance. However, finding the best data augmentation hyperparameters requires domain knowledge or a computationally demanding search. We address this issue by proposing an efficient approach to automatically train a network that learns an effective distribution of transformations to improve its generalization score. Using bilevel optimization, we directly optimize the data augmentation parameters using a validation set. This framework can be used as a general solution to learn the optimal data augmentation jointly with an end task model like a classifier. Results show that our joint training method produces an image classification accuracy that is comparable to or better than carefully hand-crafted data augmentation. Yet, it does not need an expensive external validation loop on the data augmentation hyperparameters.

31.Adaptive additive classification-based loss for deep metric learning ⬇️

Recent works have shown that deep metric learning algorithms can benefit from weak supervision from another input modality. This additional modality can be incorporated directly into the popular triplet-based loss function as distances. Also recently, classification loss and proxy-based metric learning have been observed to lead to faster convergence as well as better retrieval results, all the while without requiring complex and costly sampling strategies. In this paper we propose an extension to the existing adaptive margin for classification-based deep metric learning. Our extension introduces a separate margin for each negative proxy per sample. These margins are computed during training from precomputed distances of the classes in the other modality. Our results set a new state-of-the-art on both on the Amazon fashion retrieval dataset as well as on the public DeepFashion dataset. This was observed with both fastText- and BERT-based embeddings for the additional textual modality. Our results were achieved with faster convergence and lower code complexity than the prior state-of-the-art.

32.Duodepth: Static Gesture Recognition Via Dual Depth Sensors ⬇️

Static gesture recognition is an effective non-verbal communication channel between a user and their devices; however many modern methods are sensitive to the relative pose of the user's hands with respect to the capture device, as parts of the gesture can become occluded. We present two methodologies for gesture recognition via synchronized recording from two depth cameras to alleviate this occlusion problem. One is a more classic approach using iterative closest point registration to accurately fuse point clouds and a single PointNet architecture for classification, and the other is a dual Point-Net architecture for classification without registration. On a manually collected data-set of 20,100 point clouds we show a 39.2% reduction in misclassification for the fused point cloud method, and 53.4% for the dual PointNet, when compared to a standard single camera pipeline.

33.Fully Convolutional Open Set Segmentation ⬇️

In semantic segmentation knowing about all existing classes is essential to yield effective results with the majority of existing approaches. However, these methods trained in a Closed Set of classes fail when new classes are found in the test phase. It means that they are not suitable for Open Set scenarios, which are very common in real-world computer vision and remote sensing applications. In this paper, we discuss the limitations of Closed Set segmentation and propose two fully convolutional approaches to effectively address Open Set semantic segmentation: OpenFCN and OpenPCS. OpenFCN is based on the well-known OpenMax algorithm, configuring a new application of this approach in segmentation settings. OpenPCS is a fully novel approach based on feature-space from DNN activations that serve as features for computing PCA and multi-variate gaussian likelihood in a lower dimensional space. Experiments were conducted on the well-known Vaihingen and Potsdam segmentation datasets. OpenFCN showed little-to-no improvement when compared to the simpler and much more time efficient SoftMax thresholding, while being between some orders of magnitude slower. OpenPCS achieved promising results in almost all experiments by overcoming both OpenFCN and SoftMax thresholding. OpenPCS is also a reasonable compromise between the runtime performances of the extremely fast SoftMax thresholding and the extremely slow OpenFCN, being close able to run close to real-time. Experiments also indicate that OpenPCS is effective, robust and suitable for Open Set segmentation, being able to improve the recognition of unknown class pixels without reducing the accuracy on the known class pixels.

34.SPSG: Self-Supervised Photometric Scene Generation from RGB-D Scans ⬇️

We present SPSG, a novel approach to generate high-quality, colored 3D models of scenes from RGB-D scan observations by learning to infer unobserved scene geometry and color in a self-supervised fashion. Our self-supervised approach learns to jointly inpaint geometry and color by correlating an incomplete RGB-D scan with a more complete version of that scan. Notably, rather than relying on 3D reconstruction losses to inform our 3D geometry and color reconstruction, we propose adversarial and perceptual losses operating on 2D renderings in order to achieve high-resolution, high-quality colored reconstructions of scenes. This exploits the high-resolution, self-consistent signal from individual raw RGB-D frames, in contrast to fused 3D reconstructions of the frames which exhibit inconsistencies from view-dependent effects, such as color balancing or pose inconsistencies. Thus, by informing our 3D scene generation directly through 2D signal, we produce high-quality colored reconstructions of 3D scenes, outperforming state of the art on both synthetic and real data.

35.Determining Image similarity with Quasi-Euclidean Metric ⬇️

Image similarity is a core concept in Image Analysis due to its extensive application in computer vision, image processing, and pattern recognition. The objective of our study is to evaluate Quasi-Euclidean metric as an image similarity measure and analyze how it fares against the existing standard ways like SSIM and Euclidean metric. In this paper, we analyzed the similarity between two images from our own novice dataset and assessed its performance against the Euclidean distance metric and SSIM. We also present experimental results along with evidence indicating that our proposed implementation when applied to our novice dataset, furnished different results than standard metrics in terms of effectiveness and accuracy. In some cases, our methodology projected remarkable performance and it is also interesting to note that our implementation proves to be a step ahead in recognizing similarity when compared to

36.Computing Light Transport Gradients using the Adjoint Method ⬇️

This paper proposes a new equation from continuous adjoint theory to compute the gradient of quantities governed by the Transport Theory of light. Unlike discrete gradients ala autograd, which work at the code level, we first formulate the continuous theory and then discretize it. The key insight of this paper is that computing gradients in Transport Theory is akin to computing the importance, a quantity adjoint to radiance that satisfies an adjoint equation. Importance tells us where to look for light that matters. This is one of the key insights of this paper. In fact, this mathematical journey started from a whimsical thought that these adjoints might be related. Computing gradients is therefore no more complicated than computing the importance field. This insight and the following paper hopefully will shed some light on this complicated problem and ease the implementations of gradient computations in existing path tracers.

37.A Loss Function for Generative Neural Networks Based on Watson's Perceptual Model ⬇️

To train Variational Autoencoders (VAEs) to generate realistic imagery requires a loss function that reflects human perception of image similarity. We propose such a loss function based on Watson's perceptual model, which computes a weighted distance in frequency space and accounts for luminance and contrast masking. We extend the model to color images, increase its robustness to translation by using the Fourier Transform, remove artifacts due to splitting the image into blocks, and make it differentiable. In experiments, VAEs trained with the new loss function generated realistic, high-quality image samples. Compared to using the Euclidean distance and the Structural Similarity Index, the images were less blurry; compared to deep neural network based losses, the new approach required less computational resources and generated images with less artifacts.

38.Object-Centric Learning with Slot Attention ⬇️

Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks.

39.SAR2SAR: a self-supervised despeckling algorithm for SAR images ⬇️

Speckle reduction is a key step in many remote sensing applications. By strongly affecting synthetic aperture radar (SAR) images, it makes them difficult to analyse. Due to the difficulty to model the spatial correlation of speckle, a deep learning algorithm with self-supervision is proposed in this paper: SAR2SAR. Multi-temporal time series are leveraged and the neural network learns to restore SAR images by only looking at noisy acquisitions. To this purpose, the recently proposed noise2noise framework has been employed. The strategy to adapt it to SAR despeckling is presented, based on a compensation of temporal changes and a loss function adapted to the statistics of speckle.
A study with synthetic speckle noise is presented to compare the performances of the proposed method with other state-of-the-art filters. Then, results on real images are discussed, to show the potential of the proposed algorithm. The code is made available to allow testing and reproducible research in this field.

40.An Interactive Data Visualization and Analytics Tool to Evaluate Mobility and Sociability Trends During COVID-19 ⬇️

The COVID-19 outbreak has dramatically changed travel behavior in affected cities. The C2SMART research team has been investigating the impact of COVID-19 on mobility and sociability. New York City (NYC) and Seattle, two of the cities most affected by COVID-19 in the U.S. were included in our initial study. An all-in-one dashboard with data mining and cloud computing capabilities was developed for interactive data analytics and visualization to facilitate the understanding of the impact of the outbreak and corresponding policies such as social distancing on transportation systems. This platform is updated regularly and continues to evolve with the addition of new data, impact metrics, and visualizations to assist public and decision-makers to make informed decisions. This paper presents the architecture of the COVID related mobility data dashboard and preliminary mobility and sociability metrics for NYC and Seattle.

41.Orthogonal Deep Models As Defense Against Black-Box Attacks ⬇️

Deep learning has demonstrated state-of-the-art performance for a variety of challenging computer vision tasks. On one hand, this has enabled deep visual models to pave the way for a plethora of critical applications like disease prognostics and smart surveillance. On the other, deep learning has also been found vulnerable to adversarial attacks, which calls for new techniques to defend deep models against these attacks. Among the attack algorithms, the black-box schemes are of serious practical concern since they only need publicly available knowledge of the targeted model. We carefully analyze the inherent weakness of deep models in black-box settings where the attacker may develop the attack using a model similar to the targeted model. Based on our analysis, we introduce a novel gradient regularization scheme that encourages the internal representation of a deep model to be orthogonal to another, even if the architectures of the two models are similar. Our unique constraint allows a model to concomitantly endeavour for higher accuracy while maintaining near orthogonal alignment of gradients with respect to a reference model. Detailed empirical study verifies that controlled misalignment of gradients under our orthogonality objective significantly boosts a model's robustness against transferable black-box adversarial attacks. In comparison to regular models, the orthogonal models are significantly more robust to a range of $l_p$ norm bounded perturbations. We verify the effectiveness of our technique on a variety of large-scale models.

42.Not all Failure Modes are Created Equal: Training Deep Neural Networks for Explicable (Mis)Classification ⬇️

Deep Neural Networks are often brittle on image classification tasks and known to misclassify inputs. While these misclassifications may be inevitable, all failure modes cannot be considered equal. Certain misclassifications (eg. classifying the image of a dog to an airplane) can create surprise and result in the loss of human trust in the system. Even worse, certain errors (eg. a person misclassified as a primate) can have societal impacts. Thus, in this work, we aim to reduce inexplicable errors. To address this challenge, we first discuss how to obtain the class-level semantics that captures the human's expectation ($M^h$) regarding which classes are semantically close vs. ones that are far away. We show that for data-sets like CIFAR-10 and CIFAR-100, class-level semantics can be obtained by leveraging human subject studies (significantly inexpensive compared to existing works) and, whenever possible, by utilizing publicly available human-curated knowledge. Second, we propose the use of Weighted Loss Functions to penalize misclassifications by the weight of their inexplicability. Finally, we show that training (or even fine-tuning) existing classifiers with the two proposed methods lead to Deep Neural Networks that have (1) comparable top-1 accuracy, an important metric in operational contexts, (2) more explicable failure modes and (3) require significantly less cost in teams of additional human labels compared to existing work.

43.A survey of loss functions for semantic segmentation ⬇️

Image Segmentation has been an active field of research, as it has the potential to fix loopholes in healthcare, and help the mass. In the past 5 years, various papers came up with different objective loss functions used in different cases such as biased data, sparse segmentation, etc. In this paper, we have summarized most of the well-known loss functions widely used in Image segmentation and listed out the cases where their usage can help in fast and better convergence of a Model. Furthermore, We have also introduced a new log-cosh dice loss function and compared its performance on NBFS skull stripping with widely used loss functions. We showcased that certain loss functions perform well across all datasets and can be taken as a good choice in unknown-distribution datasets. The code is available at this https URL.

44.Point Proposal Network for Reconstructing 3D Particle Positions with Sub-Pixel Precision in Liquid Argon Time Projection Chambers ⬇️

Liquid Argon Time Projection Chambers (LArTPC) are particle imaging detectors recording 2D or 3D images of numerous complex trajectories of charged particles. Identifying points of interest in these images, such as the starting and ending points of particles trajectories, is a crucial step of identifying and analyzing these particles and impacts inference of physics signals such as neutrino interaction. The Point Proposal Network is designed to discover specific points of interest, namely the starting and ending points of track-like particle trajectories such as muons and protons, and the starting points of electromagnetic shower-like particle trajectories such as electrons and gamma rays. The algorithm predicts with a sub-voxel precision their spatial location, and also determines the category of the identified points of interest. Using the PILArNet public LArTPC data sample as a benchmark, our algorithm successfully predicted 96.8%, 97.8%, and 98.1% of 3D points within the voxel distance of 3, 10, and 20 from the provided true point locations respectively. For the predicted 3D points within 3 voxels of the closest true point locations, the median distance is found to be 0.25 voxels, achieving the sub-voxel level precision. We report that the majority of predicted points that are more than 10 voxels away from the closest true point locations are legitimate mistakes, and our algorithm achieved high enough accuracy to identify issues associated with a small fraction of true point locations provided in the dataset. Further, using those predicted points, we demonstrate a set of simple algorithms to cluster 3D voxels into individual track-like particle trajectories at the clustering efficiency, purity, and Adjusted Rand Index of 83.2%, 96.7%, and 94.7% respectively.

45.Graph Optimal Transport for Cross-Domain Alignment ⬇️

Cross-domain alignment between two sets of entities (e.g., objects in an image, words in a sentence) is fundamental to both computer vision and natural language processing. Existing methods mainly focus on designing advanced attention mechanisms to simulate soft alignment, with no training signals to explicitly encourage alignment. The learned attention matrices are also dense and lacks interpretability. We propose Graph Optimal Transport (GOT), a principled framework that germinates from recent advances in Optimal Transport (OT). In GOT, cross-domain alignment is formulated as a graph matching problem, by representing entities into a dynamically-constructed graph. Two types of OT distances are considered: (i) Wasserstein distance (WD) for node (entity) matching; and (ii) Gromov-Wasserstein distance (GWD) for edge (structure) matching. Both WD and GWD can be incorporated into existing neural network models, effectively acting as a drop-in regularizer. The inferred transport plan also yields sparse and self-normalized alignment, enhancing the interpretability of the learned model. Experiments show consistent outperformance of GOT over baselines across a wide range of tasks, including image-text retrieval, visual question answering, image captioning, machine translation, and text summarization.

46.Cascaded Convolutional Neural Networks with Perceptual Loss for Low Dose CT Denoising ⬇️

Low Dose CT Denoising research aims to reduce the risks of radiation exposure to patients. Recently researchers have used deep learning to denoise low dose CT images with promising results. However, approaches that use mean-squared-error (MSE) tend to over smooth the image resulting in loss of fine structural details in low contrast regions of the image. These regions are often crucial for diagnosis and must be preserved in order for Low dose CT to be used effectively in practice. In this work we use a cascade of two neural networks, the first of which aims to reconstruct normal dose CT from low dose CT by minimizing perceptual loss, and the second which predicts the difference between the ground truth and prediction from the perceptual loss network. We show that our method outperforms related works and more effectively reconstructs fine structural details in low contrast regions of the image.

47.Deep Q-Network-Driven Catheter Segmentation in 3D US by Hybrid Constrained Semi-Supervised Learning and Dual-UNet ⬇️

Catheter segmentation in 3D ultrasound is important for computer-assisted cardiac intervention. However, a large amount of labeled images are required to train a successful deep convolutional neural network (CNN) to segment the catheter, which is expensive and time-consuming. In this paper, we propose a novel catheter segmentation approach, which requests fewer annotations than the supervised learning method, but nevertheless achieves better performance. Our scheme considers a deep Q learning as the pre-localization step, which avoids voxel-level annotation and which can efficiently localize the target catheter. With the detected catheter, patch-based Dual-UNet is applied to segment the catheter in 3D volumetric data. To train the Dual-UNet with limited labeled images and leverage information of unlabeled images, we propose a novel semi-supervised scheme, which exploits unlabeled images based on hybrid constraints from predictions. Experiments show the proposed scheme achieves a higher performance than state-of-the-art semi-supervised methods, while it demonstrates that our method is able to learn from large-scale unlabeled images.

48.Can 3D Adversarial Logos Cloak Humans? ⬇️

With the trend of adversarial attacks, researchers attempt to fool trained object detectors in 2D scenes. Among many of them, an intriguing new form of attack with potential real-world usage is to append adversarial patches (e.g. logos) to images. Nevertheless, much less have we known about adversarial attacks from 3D rendering views, which is essential for the attack to be persistently strong in the physical world. This paper presents a new 3D adversarial logo attack: we construct an arbitrary shape logo from a 2D texture image and map this image into a 3D adversarial logo via a texture mapping called logo transformation. The resulting 3D adversarial logo is then viewed as an adversarial texture enabling easy manipulation of its shape and position. This greatly extends the versatility of adversarial training for computer graphics synthesized imagery. Contrary to the traditional adversarial patch, this new form of attack is mapped into the 3D object world and back-propagates to the 2D image domain through differentiable rendering. In addition, and unlike existing adversarial patches, our new 3D adversarial logo is shown to fool state-of-the-art deep object detectors robustly under model rotations, leading to one step further for realistic attacks in the physical world. Our codes are available at this https URL.