Skip to content

Latest commit

 

History

History
115 lines (115 loc) · 71.1 KB

20200409.md

File metadata and controls

115 lines (115 loc) · 71.1 KB

ArXiv cs.CV --Thu, 9 Apr 2020

1.Slicing and dicing soccer: automatic detection ofcomplex events from spatio-temporal data ⬇️

The automatic detection of events in sport videos has im-portant applications for data analytics, as well as for broadcasting andmedia companies. This paper presents a comprehensive approach for de-tecting a wide range of complex events in soccer videos starting frompositional data. The event detector is designed as a two-tier system thatdetectsatomicandcomplex events. Atomic events are detected basedon temporal and logical combinations of the detected objects, their rel-ative distances, as well as spatio-temporal features such as velocity andacceleration. Complex events are defined as temporal and logical com-binations of atomic and complex events, and are expressed by meansof a declarative Interval Temporal Logic (ITL). The effectiveness of theproposed approach is demonstrated over 16 different events, includingcomplex situations such as tackles and filtering passes. By formalizingevents based on principled ITL, it is possible to easily perform reason-ing tasks, such as understanding which passes or crosses result in a goalbeing scored. To counterbalance the lack of suitable, annotated publicdatasets, we built on an open source soccer simulation engine to re-lease the synthetic SoccER (Soccer Event Recognition) dataset, whichincludes complete positional data and annotations for more than 1.6 mil-lion atomic events and 9,000 complex events. The dataset and code areavailable at this https URL

2.Self-Supervised Monocular Scene Flow Estimation ⬇️

Scene flow estimation has been receiving increasing attention for 3D environment perception. Monocular scene flow estimation -- obtaining 3D structure and 3D motion from two temporally consecutive images -- is a highly ill-posed problem, and practical solutions are lacking to date. We propose a novel monocular scene flow method that yields competitive accuracy and real-time performance. By taking an inverse problem view, we design a single convolutional neural network (CNN) that successfully estimates depth and 3D motion simultaneously from a classical optical flow cost volume. We adopt self-supervised learning with 3D loss functions and occlusion reasoning to leverage unlabeled data. We validate our design choices, including the proxy loss and augmentation setup. Our model achieves state-of-the-art accuracy among unsupervised/self-supervised learning approaches to monocular scene flow, and yields competitive results for the optical flow and monocular depth estimation sub-tasks. Semi-supervised fine-tuning further improves the accuracy and yields promising results in real-time.

3.Skin Diseases Detection using LBP and WLD- An Ensembling Approach ⬇️

In all developing and developed countries in the world, skin diseases are becoming a very frequent health problem for the humans of all age groups. Skin problems affect mental health, develop addiction to alcohol and drugs and sometimes causes social isolation. Considering the importance, we propose an automatic technique to detect three popular skin diseases- Leprosy, Tinea versicolor and Vitiligofrom the images of skin lesions. The proposed technique involves Weber local descriptor and Local binary pattern to represent texture pattern of the affected skin regions. This ensemble technique achieved 91.38% accuracy using multi-level support vector machine classifier, where features are extracted from different regions that are based on center of gravity. We have also applied some popular deep learn-ing networks such as MobileNet, ResNet_152, GoogLeNet,DenseNet_121, and ResNet_101. We get 89% accuracy using ResNet_101. The ensemble approach clearly outperform all of the used deep learning networks. This imaging tool will be useful for early skin disease screening.

4.Weakly Supervised Semantic Point Cloud Segmentation:Towards 10X Fewer Labels ⬇️

Point cloud analysis has received much attention recently; and segmentation is one of the most important tasks. The success of existing approaches is attributed to deep network design and large amount of labelled training data, where the latter is assumed to be always available. However, obtaining 3d point cloud segmentation labels is often very costly in practice. In this work, we propose a weakly supervised point cloud segmentation approach which requires only a tiny fraction of points to be labelled in the training stage. This is made possible by learning gradient approximation and exploitation of additional spatial and color smoothness constraints. Experiments are done on three public datasets with different degrees of weak supervision. In particular, our proposed method can produce results that are close to and sometimes even better than its fully supervised counterpart with 10$\times$ fewer labels.

5.Beyond Photometric Consistency: Gradient-based Dissimilarity for Improving Visual Odometry and Stereo Matching ⬇️

Pose estimation and map building are central ingredients of autonomous robots and typically rely on the registration of sensor data. In this paper, we investigate a new metric for registering images that builds upon on the idea of the photometric error. Our approach combines a gradient orientation-based metric with a magnitude-dependent scaling term. We integrate both into stereo estimation as well as visual odometry systems and show clear benefits for typical disparity and direct image registration tasks when using our proposed metric. Our experimental evaluation indicats that our metric leads to more robust and more accurate estimates of the scene depth as well as camera trajectory. Thus, the metric improves camera pose estimation and in turn the mapping capabilities of mobile robots. We believe that a series of existing visual odometry and visual SLAM systems can benefit from the findings reported in this paper.

6.Satellite-based Prediction of Forage Conditions for Livestock in Northern Kenya ⬇️

This paper introduces the first dataset of satellite images labeled with forage quality by on-the-ground experts and provides proof of concept for applying computer vision methods to index-based drought insurance. We also present the results of a collaborative benchmark tool used to crowdsource an accurate machine learning model on the dataset. Our methods significantly outperform the existing technology for an insurance program in Northern Kenya, suggesting that a computer vision-based approach could substantially benefit pastoralists, whose exposure to droughts is severe and worsening with climate change.

7.Convolutional neural net face recognition works in non-human-like ways ⬇️

Convolutional neural networks (CNNs) give state of the art performance in many pattern recognition problems but can be fooled by carefully crafted patterns of noise. We report that CNN face recognition systems also make surprising "errors". We tested six commercial face recognition CNNs and found that they outperform typical human participants on standard face matching tasks. However, they also declare matches that humans would not, where one image from the pair has been transformed to look a different sex or race. This is not due to poor performance; the best CNNs perform almost perfectly on the human face matching tasks, but also declare the most matches for faces of a different apparent race or sex. Although differing on the salience of sex and race, humans and computer systems are not working in completely different ways. They tend to find the same pairs of images difficult, suggesting some agreement about the underlying similarity space.

8.A Deep Learning Approach for Determining Effects of Tuta Absoluta in Tomato Plants ⬇️

Early quantification of Tuta absoluta pest's effects in tomato plants is a very important factor in controlling and preventing serious damages of the pest. The invasion of Tuta absoluta is considered a major threat to tomato production causing heavy loss ranging from 80 to 100 percent when not properly managed. Therefore, real-time and early quantification of tomato leaf miner Tuta absoluta, can play an important role in addressing the issue of pest management and enhance farmers' decisions. In this study, we propose a Convolutional Neural Network (CNN) approach in determining the effects of Tuta absoluta in tomato plants. Four CNN pre-trained architectures (VGG16, VGG19, ResNet and Inception-V3) were used in training classifiers on a dataset containing health and infested tomato leaves collected from real field experiments. Among the pre-trained architectures, experimental results showed that Inception-V3 yielded the best results with an average accuracy of 87.2 percent in estimating the severity status of Tuta absoluta in tomato plants. The pre-trained models could also easily identify High Tuta severity status compared to other severity status (Low tuta and No tuta)

9.Multi-Person Absolute 3D Human Pose Estimation with Weak Depth Supervision ⬇️

In 3D human pose estimation one of the biggest problems is the lack of large, diverse datasets. This is especially true for multi-person 3D pose estimation, where, to our knowledge, there are only machine generated annotations available for training. To mitigate this issue, we introduce a network that can be trained with additional RGB-D images in a weakly supervised fashion. Due to the existence of cheap sensors, videos with depth maps are widely available, and our method can exploit a large, unannotated dataset. Our algorithm is a monocular, multi-person, absolute pose estimator. We evaluate the algorithm on several benchmarks, showing a consistent improvement in error rates. Also, our model achieves state-of-the-art results on the MuPoTS-3D dataset by a considerable margin.

10.Learning 3D Semantic Scene Graphs from 3D Indoor Reconstructions ⬇️

Scene understanding has been of high interest in computer vision. It encompasses not only identifying objects in a scene, but also their relationships within the given context. With this goal, a recent line of works tackles 3D semantic segmentation and scene layout prediction. In our work we focus on scene graphs, a data structure that organizes the entities of a scene in a graph, where objects are nodes and their relationships modeled as edges. We leverage inference on scene graphs as a way to carry out 3D scene understanding, mapping objects and their relationships. In particular, we propose a learned method that regresses a scene graph from the point cloud of a scene. Our novel architecture is based on PointNet and Graph Convolutional Networks (GCN). In addition, we introduce 3DSSG, a semi-automatically generated dataset, that contains semantically rich scene graphs of 3D scenes. We show the application of our method in a domain-agnostic retrieval task, where graphs serve as an intermediate representation for 3D-3D and 2D-3D matching.

11.Adversary Helps: Gradient-based Device-Free Domain-Independent Gesture Recognition ⬇️

Wireless signal-based gesture recognition has promoted the developments of VR game, smart home, etc. However, traditional approaches suffer from the influence of the domain gap. Low recognition accuracy occurs when the recognition model is trained in one domain but is used in another domain. Though some solutions, such as adversarial learning, transfer learning and body-coordinate velocity profile, have been proposed to achieve cross-domain recognition, these solutions more or less have flaws. In this paper, we define the concept of domain gap and then propose a more promising solution, namely DI, to eliminate domain gap and further achieve domain-independent gesture recognition. DI leverages the sign map of the gradient map as the domain gap eliminator to improve the recognition accuracy. We conduct experiments with ten domains and ten gestures. The experiment results show that DI can achieve the recognition accuracies of 87.13%, 90.12% and 94.45% on KNN, SVM and CNN, which outperforms existing solutions.

12.Improved YOLOv3 Object Classification in Intelligent Transportation System ⬇️

The technology of vehicle and driver detection in Intelligent Transportation System(ITS) is a hot topic in recent years. In particular, the driver detection is still a challenging problem which is conductive to supervising traffic order and maintaining public safety. In this paper, an algorithm based on YOLOv3 is proposed to realize the detection and classification of vehicles, drivers, and people on the highway, so as to achieve the purpose of distinguishing driver and passenger and form a one-to-one correspondence between vehicles and drivers. The proposed model and contrast experiment are conducted on our self-build traffic driver's face database. The effectiveness of our proposed algorithm is validated by extensive experiments and verified under various complex highway conditions. Compared with other advanced vehicle and driver detection technologies, the model has a good performance and is robust to road blocking, different attitudes, and extreme lighting.

13.Constrained Multi-shape Evolution for Overlapping Cytoplasm Segmentation ⬇️

Segmenting overlapping cytoplasm of cells in cervical smear images is a clinically essential task, for quantitatively measuring cell-level features in order to diagnose cervical cancer. This task, however, remains rather challenging, mainly due to the deficiency of intensity (or color) information in the overlapping region. Although shape prior-based models that compensate intensity deficiency by introducing prior shape information (shape priors) about cytoplasm are firmly established, they often yield visually implausible results, mainly because they model shape priors only by limited shape hypotheses about cytoplasm, exploit cytoplasm-level shape priors alone, and impose no shape constraint on the resulting shape of the cytoplasm. In this paper, we present a novel and effective shape prior-based approach, called constrained multi-shape evolution, that segments all overlapping cytoplasms in the clump simultaneously by jointly evolving each cytoplasm's shape guided by the modeled shape priors. We model local shape priors (cytoplasm--level) by an infinitely large shape hypothesis set which contains all possible shapes of the cytoplasm. In the shape evolution, we compensate intensity deficiency for the segmentation by introducing not only the modeled local shape priors but also global shape priors (clump--level) modeled by considering mutual shape constraints of cytoplasms in the clump. We also constrain the resulting shape in each evolution to be in the built shape hypothesis set, for further reducing implausible segmentation results. We evaluated the proposed method in two typical cervical smear datasets, and the extensive experimental results show that the proposed method is effective to segment overlapping cytoplasm, consistently outperforming the state-of-the-art methods.

14.CNN in CT Image Segmentation: Beyound Loss Function for Expoliting Ground Truth Images ⬇️

Exploiting more information from ground truth (GT) images now is a new research direction for further improving CNN's performance in CT image segmentation. Previous methods focus on devising the loss function for fulfilling such a purpose. However, it is rather difficult to devise a general and optimization-friendly loss function. We here present a novel and practical method that exploits GT images beyond the loss function. Our insight is that feature maps of two CNNs trained respectively on GT and CT images should be similar on some metric space, because they both are used to describe the same objects for the same purpose. We hence exploit GT images by enforcing such two CNNs' feature maps to be consistent. We assess the proposed method on two data sets, and compare its performance to several competitive methods. Extensive experimental results show that the proposed method is effective, outperforming all the compared methods.

15.Monte-Carlo Siamese Policy on Actor for Satellite Image Super Resolution ⬇️

In the past few years supervised and adversarial learning have been widely adopted in various complex computer vision tasks. It seems natural to wonder whether another branch of artificial intelligence, commonly known as Reinforcement Learning (RL) can benefit such complex vision tasks. In this study, we explore the plausible usage of RL in super resolution of remote sensing imagery. Guided by recent advances in super resolution, we propose a theoretical framework that leverages the benefits of supervised and reinforcement learning. We argue that a straightforward implementation of RL is not adequate to address ill-posed super resolution as the action variables are not fully known. To tackle this issue, we propose to parameterize action variables by matrices, and train our policy network using Monte-Carlo sampling. We study the implications of parametric action space in a model-free environment from theoretical and empirical perspective. Furthermore, we analyze the quantitative and qualitative results on both remote sensing and non-remote sensing datasets. Based on our experiments, we report considerable improvement over state-of-the-art methods by encapsulating supervised models in a reinforcement learning framework.

16.A Robust Method for Image Stitching ⬇️

We propose a novel method for image stitching that is robust against repetitive patterns and featureless regions in the imaginary. In such cases, typical image stitching methods easily produce stitching artifacts, since they may produce false pairwise image registrations that are in conflict within the global connectivity graph. By contrast, our method collects all the plausible pairwise image registration candidates, among which globally consistent candidates are chosen. This enables the method to determine the correct pairwise registrations by utilizing all the available information from the whole imaginary, such as unambiguous registrations outside the repeating pattern and featureless regions. We formalize the method as a weighted multigraph whose nodes represent the individual image transformations from the composite image, and whose sets of multiple edges between two nodes represent all the plausible transformations between the pixel coordinates of the two images. The edge weights represent the plausibility of the transformations. The image transformations and the edge weights are solved from a non-linear minimization problem with linear constraints, for which a projection method is used. As an example, we apply the method in a scanning application where the transformations are primarily translations with only slight rotation and scaling component.

17.MNIST-MIX: A Multi-language Handwritten Digit Recognition Dataset ⬇️

In this letter, we contribute a multi-language handwritten digit recognition dataset named MNIST-MIX, which is the largest dataset of the same type in terms of both languages and data samples. With the same data format with MNIST, MNIST-MIX can be seamlessly applied in existing studies for handwritten digit recognition. By introducing digits from 10 different languages, MNIST-MIX becomes a more challenging dataset and its imbalanced classification requires a better design of models. We also present the results of applying a LeNet model which is pre-trained on MNIST as the baseline.

18.Multi-Head Attention-based Probabilistic Vehicle Trajectory Prediction ⬇️

This paper presents online-capable deep learning model for probabilistic vehicle trajectory prediction. We propose a simple encoder-decoder architecture based on multi-head attention. The proposed model generates the distribution of the predicted trajectories for multiple vehicles in parallel. Our approach to model the interactions can learn to attend to a few influential vehicles in an unsupervised manner, which can improve the interpretability of the network. The experiments using naturalistic trajectories at highway show the clear improvement in terms of positional error on both longitudinal and lateral direction.

19.Change Detection in Heterogeneous Optical and SAR Remote Sensing Images via Deep Homogeneous Feature Fusion ⬇️

Change detection in heterogeneous remote sensing images is crucial for disaster damage assessment. Recent methods use homogenous transformation, which transforms the heterogeneous optical and SAR remote sensing images into the same feature space, to achieve change detection. Such transformations mainly operate on the low-level feature space and may corrupt the semantic content, deteriorating the performance of change detection. To solve this problem, this paper presents a new homogeneous transformation model termed deep homogeneous feature fusion (DHFF) based on image style transfer (IST). Unlike the existing methods, the DHFF method segregates the semantic content and the style features in the heterogeneous images to perform homogeneous transformation. The separation of the semantic content and the style in homogeneous transformation prevents the corruption of image semantic content, especially in the regions of change. In this way, the detection performance is improved with accurate homogeneous transformation. Furthermore, we present a new iterative IST (IIST) strategy, where the cost function in each IST iteration measures and thus maximizes the feature homogeneity in additional new feature subspaces for change detection. After that, change detection is accomplished accurately on the original and the transformed images that are in the same feature space. Real remote sensing images acquired by SAR and optical satellites are utilized to evaluate the performance of the proposed method. The experiments demonstrate that the proposed DHFF method achieves significant improvement for change detection in heterogeneous optical and SAR remote sensing images, in terms of both accuracy rate and Kappa index.

20.Attentive Normalization for Conditional Image Generation ⬇️

Traditional convolution-based generative adversarial networks synthesize images based on hierarchical local operations, where long-range dependency relation is implicitly modeled with a Markov chain. It is still not sufficient for categories with complicated structures. In this paper, we characterize long-range dependence with attentive normalization (AN), which is an extension to traditional instance normalization. Specifically, the input feature map is softly divided into several regions based on its internal semantic similarity, which are respectively normalized. It enhances consistency between distant regions with semantic correspondence. Compared with self-attention GAN, our attentive normalization does not need to measure the correlation of all locations, and thus can be directly applied to large-size feature maps without much computational burden. Extensive experiments on class-conditional image generation and semantic inpainting verify the efficacy of our proposed module.

21.Feature Re-Learning with Data Augmentation for Video Relevance Prediction ⬇️

Predicting the relevance between two given videos with respect to their visual content is a key component for content-based video recommendation and retrieval. Thanks to the increasing availability of pre-trained image and video convolutional neural network models, deep visual features are widely used for video content representation. However, as how two videos are relevant is task-dependent, such off-the-shelf features are not always optimal for all tasks. Moreover, due to varied concerns including copyright, privacy and security, one might have access to only pre-computed video features rather than original videos. We propose in this paper feature re-learning for improving video relevance prediction, with no need of revisiting the original video content. In particular, re-learning is realized by projecting a given deep feature into a new space by an affine transformation. We optimize the re-learning process by a novel negative-enhanced triplet ranking loss. In order to generate more training data, we propose a new data augmentation strategy which works directly on frame-level and video-level features. Extensive experiments in the context of the Hulu Content-based Video Relevance Prediction Challenge 2018 justify the effectiveness of the proposed method and its state-of-the-art performance for content-based video relevance prediction.

22.MirrorNet: A Deep Bayesian Approach to Reflective 2D Pose Estimation from Human Images ⬇️

This paper proposes a statistical approach to 2D pose estimation from human images. The main problems with the standard supervised approach, which is based on a deep recognition (image-to-pose) model, are that it often yields anatomically implausible poses, and its performance is limited by the amount of paired data. To solve these problems, we propose a semi-supervised method that can make effective use of images with and without pose annotations. Specifically, we formulate a hierarchical generative model of poses and images by integrating a deep generative model of poses from pose features with that of images from poses and image features. We then introduce a deep recognition model that infers poses from images. Given images as observed data, these models can be trained jointly in a hierarchical variational autoencoding (image-to-pose-to-feature-to-pose-to-image) manner. The results of experiments show that the proposed reflective architecture makes estimated poses anatomically plausible, and the performance of pose estimation improved by integrating the recognition and generative models and also by feeding non-annotated images.

23.State of the Art on Neural Rendering ⬇️

Efficient rendering of photo-realistic virtual worlds is a long standing effort of computer graphics. Modern graphics techniques have succeeded in synthesizing photo-realistic images from hand-crafted scene representations. However, the automatic generation of shape, materials, lighting, and other aspects of scenes remains a challenging problem that, if solved, would make photo-realistic computer graphics more widely accessible. Concurrently, progress in computer vision and machine learning have given rise to a new approach to image synthesis and editing, namely deep generative models. Neural rendering is a new and rapidly emerging field that combines generative machine learning techniques with physical knowledge from computer graphics, e.g., by the integration of differentiable rendering into network training. With a plethora of applications in computer graphics and vision, neural rendering is poised to become a new area in the graphics community, yet no survey of this emerging field exists. This state-of-the-art report summarizes the recent trends and applications of neural rendering. We focus on approaches that combine classic computer graphics techniques with deep generative models to obtain controllable and photo-realistic outputs. Starting with an overview of the underlying computer graphics and machine learning concepts, we discuss critical aspects of neural rendering approaches. This state-of-the-art report is focused on the many important use cases for the described algorithms such as novel view synthesis, semantic photo manipulation, facial and body reenactment, relighting, free-viewpoint video, and the creation of photo-realistic avatars for virtual and augmented reality telepresence. Finally, we conclude with a discussion of the social implications of such technology and investigate open research problems.

24.DMLO: Deep Matching LiDAR Odometry ⬇️

LiDAR odometry is a fundamental task for various areas such as robotics, autonomous driving. This problem is difficult since it requires the systems to be highly robust running in noisy real-world data. Existing methods are mostly local iterative methods. Feature-based global registration methods are not preferred since extracting accurate matching pairs in the nonuniform and sparse LiDAR data remains challenging. In this paper, we present Deep Matching LiDAR Odometry (DMLO), a novel learning-based framework which makes the feature matching method applicable to LiDAR odometry task. Unlike many recent learning-based methods, DMLO explicitly enforces geometry constraints in the framework. Specifically, DMLO decomposes the 6-DoF pose estimation into two parts, a learning-based matching network which provides accurate correspondences between two scans and rigid transformation estimation with a close-formed solution by Singular Value Decomposition (SVD). Comprehensive experimental results on real-world datasets KITTI and Argoverse demonstrate that our DMLO dramatically outperforms existing learning-based methods and comparable with the state-of-the-art geometry based approaches.

25.Learning for Scale-Arbitrary Super-Resolution from Scale-Specific Networks ⬇️

Recently, the performance of single image super-resolution (SR) has been significantly improved with powerful networks. However, these networks are developed for image SR with a single specific integer scale (e.g., x2;x3,x4), and cannot be used for non-integer and asymmetric SR. In this paper, we propose to learn a scale-arbitrary image SR network from scale-specific networks. Specifically, we propose a plug-in module for existing SR networks to perform scale-arbitrary SR, which consists of multiple scale-aware feature adaption blocks and a scale-aware upsampling layer. Moreover, we introduce a scale-aware knowledge transfer paradigm to transfer knowledge from scale-specific networks to the scale-arbitrary network. Our plug-in module can be easily adapted to existing networks to achieve scale-arbitrary SR. These networks plugged with our module can achieve promising results for non-integer and asymmetric SR while maintaining state-of-the-art performance for SR with integer scale factors. Besides, the additional computational and memory cost of our module is very small.

26.Learning to Detect Head Movement in Unconstrained Remote Gaze Estimation in the Wild ⬇️

Unconstrained remote gaze estimation remains challenging mostly due to its vulnerability to the large variability in head-pose. Prior solutions struggle to maintain reliable accuracy in unconstrained remote gaze tracking. Among them, appearance-based solutions demonstrate tremendous potential in improving gaze accuracy. However, existing works still suffer from head movement and are not robust enough to handle real-world scenarios. Especially most of them study gaze estimation under controlled scenarios where the collected datasets often cover limited ranges of both head-pose and gaze which introduces further bias. In this paper, we propose novel end-to-end appearance-based gaze estimation methods that could more robustly incorporate different levels of head-pose representations into gaze estimation. Our method could generalize to real-world scenarios with low image quality, different lightings and scenarios where direct head-pose information is not available. To better demonstrate the advantage of our methods, we further propose a new benchmark dataset with the most rich distribution of head-gaze combination reflecting real-world scenarios. Extensive evaluations on several public datasets and our own dataset demonstrate that our method consistently outperforms the state-of-the-art by a significant margin.

27.Mobile-Based Deep Learning Models for Banana Diseases Detection ⬇️

Smallholder farmers in Tanzania are challenged on the lack of tools for early detection of banana diseases. This study aimed at developing a mobile application for early detection of Fusarium wilt race 1 and black Sigatoka banana diseases using deep learning. We used a dataset of 3000 banana leaves images. We pre-trained our model on Resnet152 and Inceptionv3 Convolution Neural Network architectures. The Resnet152 achieved an accuracy of 99.2% and Inceptionv3 an accuracy of 95.41%. On deployment using Android mobile phones, we chose Inceptionv3 since it has lower memory requirements compared to Resnet152. The mobile application on real environment detected the two diseases with a confidence level of 99% of the captured leaf area. This result indicates the potential in improving the yield of bananas by smallholder farmers using a tool for early detection of diseases.

28.Context-Aware Group Captioning via Self-Attention and Contrastive Features ⬇️

While image captioning has progressed rapidly, existing works focus mainly on describing single images. In this paper, we introduce a new task, context-aware group captioning, which aims to describe a group of target images in the context of another group of related reference images. Context-aware group captioning requires not only summarizing information from both the target and reference image group but also contrasting between them. To solve this problem, we propose a framework combining self-attention mechanism with contrastive feature construction to effectively summarize common information from each image group while capturing discriminative information between them. To build the dataset for this task, we propose to group the images and generate the group captions based on single image captions using scene graphs matching. Our datasets are constructed on top of the public Conceptual Captions dataset and our new Stock Captions dataset. Experiments on the two datasets show the effectiveness of our method on this new task. Related Datasets and code are released at this https URL .

29.Long-Tailed Recognition Using Class-Balanced Experts ⬇️

Classic deep learning methods achieve impressive results in image recognition over large-scale artificially-balanced datasets. However, real-world datasets exhibit highly class-imbalanced distributions. In this work we address the problem of long tail recognition wherein the training set is highly imbalanced and the test set is kept balanced. The key challenges faced by any long tail recognition technique are relative imbalance amongst the classes and data scarcity or unseen concepts for mediumshot or fewshot classes. Existing techniques rely on data-resampling, cost sensitive learning, online hard example mining, reshaping the loss objective and complex memory based models to address this problem. We instead propose an ensemble of experts technique that decomposes the imbalanced problem into multiple balanced classification problems which are more tractable. Our ensemble of experts reaches close to state-of-the-art results and an extended ensemble establishes new state-of-the-art on two benchmarks for long tail recognition. We conduct numerous experiments to analyse the performance of the ensemble, and show that in modern datasets relative imbalance is a harder problem than data scarcity.

30.Exemplar Fine-Tuning for 3D Human Pose Fitting Towards In-the-Wild 3D Human Pose Estimation ⬇️

We propose a method for building large collections of human poses with full 3D annotations captured `in the wild', for which specialized capture equipment cannot be used. We start with a dataset with 2D keypoint annotations such as COCO and MPII and generates corresponding 3D poses. This is done via Exemplar Fine-Tuning (EFT), a new method to fit a 3D parametric model to 2D keypoints. EFT is accurate and can exploit a data-driven pose prior to resolve the depth reconstruction ambiguity that comes from using only 2D observations as input. We use EFT to augment these large in-the-wild datasets with plausible and accurate 3D pose annotations. We then use this data to strongly supervise a 3D pose regression network, achieving state-of-the-art results in standard benchmarks, including the ones collected outdoor. This network also achieves unprecedented 3D pose estimation quality on extremely challenging Internet videos.

31.Semantic Image Manipulation Using Scene Graphs ⬇️

Image manipulation can be considered a special case of image generation where the image to be produced is a modification of an existing image. Image generation and manipulation have been, for the most part, tasks that operate on raw pixels. However, the remarkable progress in learning rich image and object representations has opened the way for tasks such as text-to-image or layout-to-image generation that are mainly driven by semantics. In our work, we address the novel problem of image manipulation from scene graphs, in which a user can edit images by merely applying changes in the nodes or edges of a semantic graph that is generated from the image. Our goal is to encode image information in a given constellation and from there on generate new constellations, such as replacing objects or even changing relationships between objects, while respecting the semantics and style from the original image. We introduce a spatio-semantic scene graph network that does not require direct supervision for constellation changes or image edits. This makes it possible to train the system from existing real-world datasets with no additional annotation effort.

32.Radon cumulative distribution transform subspace modeling for image classification ⬇️

We present a new supervised image classification method for problems where the data at hand conform to certain deformation models applied to unknown prototypes or templates. The method makes use of the previously described Radon Cumulative Distribution Transform (R-CDT) for image data, whose mathematical properties are exploited to express the image data in a form that is more suitable for machine learning. While certain operations such as translation, scaling, and higher-order transformations are challenging to model in native image space, we show the R-CDT can capture some of these variations and thus render the associated image classification problems easier to solve. The method is simple to implement, non-iterative, has no hyper-parameters to tune, it is computationally efficient, and provides competitive accuracies to state-of-the-art neural networks for many types of classification problems, especially in a learning with few labels setting. Furthermore, we show improvements with respect to neural network-based methods in terms of computational efficiency (it can be implemented without the use of GPUs), number of training samples needed for training, as well as out-of-distribution generalization. The Python code for reproducing our results is available at this https URL.

33.TypeNet: Scaling up Keystroke Biometrics ⬇️

We study the suitability of keystroke dynamics to authenticate 100K users typing free-text. For this, we first analyze to what extent our method based on a Siamese Recurrent Neural Network (RNN) is able to authenticate users when the amount of data per user is scarce, a common scenario in free-text keystroke authentication. With 1K users for testing the network, a population size comparable to previous works, TypeNet obtains an equal error rate of 4.8% using only 5 enrollment sequences and 1 test sequence per user with 50 keystrokes per sequence. Using the same amount of data per user, as the number of test users is scaled up to 100K, the performance in comparison to 1K decays relatively by less than 5%, demonstrating the potential of TypeNet to scale well at large scale number of users. Our experiments are conducted with the Aalto University keystroke database. To the best of our knowledge, this is the largest free-text keystroke database captured with more than 136M keystrokes from 168K users.

34.PatchVAE: Learning Local Latent Codes for Recognition ⬇️

Unsupervised representation learning holds the promise of exploiting large amounts of unlabeled data to learn general representations. A promising technique for unsupervised learning is the framework of Variational Auto-encoders (VAEs). However, unsupervised representations learned by VAEs are significantly outperformed by those learned by supervised learning for recognition. Our hypothesis is that to learn useful representations for recognition the model needs to be encouraged to learn about repeating and consistent patterns in data. Drawing inspiration from the mid-level representation discovery work, we propose PatchVAE, that reasons about images at patch level. Our key contribution is a bottleneck formulation that encourages mid-level style representations in the VAE framework. Our experiments demonstrate that representations learned by our method perform much better on the recognition tasks compared to those learned by vanilla VAEs.

35.JHU-CROWD++: Large-Scale Crowd Counting Dataset and A Benchmark Method ⬇️

Due to its variety of applications in the real-world, the task of single image-based crowd counting has received a lot of interest in the recent years. Recently, several approaches have been proposed to address various problems encountered in crowd counting. These approaches are essentially based on convolutional neural networks that require large amounts of data to train the network parameters. Considering this, we introduce a new large scale unconstrained crowd counting dataset (JHU-CROWD++) that contains "4,372" images with "1.51 million" annotations. In comparison to existing datasets, the proposed dataset is collected under a variety of diverse scenarios and environmental conditions. Specifically, the dataset includes several images with weather-based degradations and illumination variations, making it a very challenging dataset. Additionally, the dataset consists of a rich set of annotations at both image-level and head-level. Several recent methods are evaluated and compared on this dataset. The dataset can be downloaded from this http URL .
Furthermore, we propose a novel crowd counting network that progressively generates crowd density maps via residual error estimation. The proposed method uses VGG16 as the backbone network and employs density map generated by the final layer as a coarse prediction to refine and generate finer density maps in a progressive fashion using residual learning. Additionally, the residual learning is guided by an uncertainty-based confidence weighting mechanism that permits the flow of only high-confidence residuals in the refinement path. The proposed Confidence Guided Deep Residual Counting Network (CG-DRCN) is evaluated on recent complex datasets, and it achieves significant improvements in errors.

36.Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood Estimation ⬇️

Many tasks in computer vision and graphics fall within the framework of conditional image synthesis. In recent years, generative adversarial nets (GANs) have delivered impressive advances in quality of synthesized images. However, it remains a challenge to generate both diverse and plausible images for the same input, due to the problem of mode collapse. In this paper, we develop a new generic multimodal conditional image synthesis method based on Implicit Maximum Likelihood Estimation (IMLE) and demonstrate improved multimodal image synthesis performance on two tasks, single image super-resolution and image synthesis from scene layouts. We make our implementation publicly available.

37.Empirical Perspectives on One-Shot Semi-supervised Learning ⬇️

One of the greatest obstacles in the adoption of deep neural networks for new applications is that training the network typically requires a large number of manually labeled training samples. We empirically investigate the scenario where one has access to large amounts of unlabeled data but require labeling only a single prototypical sample per class in order to train a deep network (i.e., one-shot semi-supervised learning). Specifically, we investigate the recent results reported in FixMatch for one-shot semi-supervised learning to understand the factors that affect and impede high accuracies and reliability for one-shot semi-supervised learning of Cifar-10. For example, we discover that one barrier to one-shot semi-supervised learning for high-performance image classification is the unevenness of class accuracy during the training. These results point to solutions that might enable more widespread adoption of one-shot semi-supervised training methods for new applications.

38.CURL: Contrastive Unsupervised Representations for Reinforcement Learning ⬇️

We present CURL: Contrastive Unsupervised Representations for Reinforcement Learning. CURL extracts high-level features from raw pixels using contrastive learning and performs off-policy control on top of the extracted features. CURL outperforms prior pixel-based methods, both model-based and model-free, on complex tasks in the DeepMind Control Suite and Atari Games showing 2.8x and 1.6x performance gains respectively at the 100K interaction steps benchmark. On the DeepMind Control Suite, CURL is the first image-based algorithm to nearly match the sample-efficiency and performance of methods that use state-based features.

39.Time accelerated image super-resolution using shallow residual feature representative network ⬇️

The recent advances in deep learning indicate significant progress in the field of single image super-resolution. With the advent of these techniques, high-resolution image with high peak signal to noise ratio (PSNR) and excellent perceptual quality can be reconstructed. The major challenges associated with existing deep convolutional neural networks are their computational complexity and time; the increasing depth of the networks, often result in high space complexity. To alleviate these issues, we developed an innovative shallow residual feature representative network (SRFRN) that uses a bicubic interpolated low-resolution image as input and residual representative units (RFR) which include serially stacked residual non-linear convolutions. Furthermore, the reconstruction of the high-resolution image is done by combining the output of the RFR units and the residual output from the bicubic interpolated LR image. Finally, multiple experiments have been performed on the benchmark datasets and the proposed model illustrates superior performance for higher scales. Besides, this model also exhibits faster execution time compared to all the existing approaches.

40.A Polynomial Neural Network with Controllable Precision and Human-Readable Topology for Prediction and System Identification ⬇️

Although the success of artificial neural networks (ANNs), there is still a concern among many over their "black box" nature. Why do they work? Could we design a "transparent" network? This paper presents a controllable and readable polynomial neural network (CR-PNN) for approximation, prediction, and system identification. CR-PNN is simple enough to be described as one "small" formula so that we can control the approximation precision and explain the internal structure of the network. CR-PNN, in fact, essentially is the fascinating Taylor expansion in the form of network. The number of layers represents precision. Derivatives in Taylor expansion are exactly imitated by error back-propagation algorithm. Firstly, we demonstrated that CR-PNN shows excellent analysis performance to the "black box" system through ten synthetic data with noise. Also, the results were compared with synthetic data to substantiate its search towards the global optimum. Secondly, it was verified, by ten real-world applications, that CR-PNN brought better generalization capability relative to the typical ANNs that approximate depended on the nonlinear activation function. Finally, 200,000 repeated experiments, with 4898 samples, demonstrated that CR-PNN is five times more efficient than typical ANN for one epoch and ten times more efficient than typical ANN for one forward-propagation. In short, compared with the traditional neural networks, the novelties and advantages of CR-PNN include readability of the internal structure, guarantees of a globally optimal solution, lower computational complexity, and likely better robustness to real-world approximation.(We're strong believers in Open Source, and provide CR-PNN code for others. GitHub: this https URL)

41.Image super-resolution reconstruction based on attention mechanism and feature fusion ⬇️

Aiming at the problems that the convolutional neural networks neglect to capture the inherent attributes of natural images and extract features only in a single scale in the field of image super-resolution reconstruction, a network structure based on attention mechanism and multi-scale feature fusion is proposed. By using the attention mechanism, the network can effectively integrate the non-local information and second-order features of the image, so as to improve the feature expression ability of the network. At the same time, the convolution kernel of different scales is used to extract the multi-scale information of the image, so as to preserve the complete information characteristics at different scales. Experimental results show that the proposed method can achieve better performance over other representative super-resolution reconstruction algorithms in objective quantitative metrics and visual quality.

42.Deep Adaptive Inference Networks for Single Image Super-Resolution ⬇️

Recent years have witnessed tremendous progress in single image super-resolution (SISR) owing to the deployment of deep convolutional neural networks (CNNs). For most existing methods, the computational cost of each SISR model is irrelevant to local image content, hardware platform and application scenario. Nonetheless, content and resource adaptive model is more preferred, and it is encouraging to apply simpler and efficient networks to the easier regions with less details and the scenarios with restricted efficiency constraints. In this paper, we take a step forward to address this issue by leveraging the adaptive inference networks for deep SISR (AdaDSR). In particular, our AdaDSR involves an SISR model as backbone and a lightweight adapter module which takes image features and resource constraint as input and predicts a map of local network depth. Adaptive inference can then be performed with the support of efficient sparse convolution, where only a fraction of the layers in the backbone is performed at a given position according to its predicted depth. The network learning can be formulated as the joint optimization of reconstruction and network depth losses. In the inference stage, the average depth can be flexibly tuned to meet a range of efficiency constraints. Experiments demonstrate the effectiveness and adaptability of our AdaDSR in contrast to its counterparts (e.g., EDSR and RCAN).

43.Training Neural Networks to Produce Compatible Features ⬇️

This paper makes a first step towards compatible and hence reusable network components. Rather than training networks for different tasks independently, we adapt the training process to produce network components that are compatible across tasks. We propose and compare several different approaches to accomplish compatibility. Our experiments on CIFAR-10 show that: (i) we can train networks to produce compatible features, without degrading task accuracy compared to training networks independently; (ii) the degree of compatibility is highly dependent on where we split the network into a feature extractor and a classification head; (iii) random initialization has a large effect on compatibility; (iv) we can train incrementally: given previously trained components, we can train new ones which are also compatible with them. This work is part of a larger goal to increase network reusability: we envision that compatibility will enable solving new tasks by mixing and matching suitable components.

44.Normalizing Flows with Multi-Scale Autoregressive Priors ⬇️

Flow-based generative models are an important class of exact inference models that admit efficient inference and sampling for image synthesis. Owing to the efficiency constraints on the design of the flow layers, e.g. split coupling flow layers in which approximately half the pixels do not undergo further transformations, they have limited expressiveness for modeling long-range data dependencies compared to autoregressive models that rely on conditional pixel-wise generation. In this work, we improve the representational power of flow-based models by introducing channel-wise dependencies in their latent space through multi-scale autoregressive priors (mAR). Our mAR prior for models with split coupling flow layers (mAR-SCF) can better capture dependencies in complex multimodal data. The resulting model achieves state-of-the-art density estimation results on MNIST, CIFAR-10, and ImageNet. Furthermore, we show that mAR-SCF allows for improved image generation quality, with gains in FID and Inception scores compared to state-of-the-art flow-based models.

45.S2A: Wasserstein GAN with Spatio-Spectral Laplacian Attention for Multi-Spectral Band Synthesis ⬇️

Intersection of adversarial learning and satellite image processing is an emerging field in remote sensing. In this study, we intend to address synthesis of high resolution multi-spectral satellite imagery using adversarial learning. Guided by the discovery of attention mechanism, we regulate the process of band synthesis through spatio-spectral Laplacian attention. Further, we use Wasserstein GAN with gradient penalty norm to improve training and stability of adversarial learning. In this regard, we introduce a new cost function for the discriminator based on spatial attention and domain adaptation loss. We critically analyze the qualitative and quantitative results compared with state-of-the-art methods using widely adopted evaluation metrics. Our experiments on datasets of three different sensors, namely LISS-3, LISS-4, and WorldView-2 show that attention learning performs favorably against state-of-the-art methods. Using the proposed method we provide an additional data product in consistent with existing high resolution bands. Furthermore, we synthesize over 4000 high resolution scenes covering various terrains to analyze scientific fidelity. At the end, we demonstrate plausible large scale real world applications of the synthesized band.

46.HybridDNN: A Framework for High-Performance Hybrid DNN Accelerator Design and Implementation ⬇️

To speedup Deep Neural Networks (DNN) accelerator design and enable effective implementation, we propose HybridDNN, a framework for building high-performance hybrid DNN accelerators and delivering FPGA-based hardware implementations. Novel techniques include a highly flexible and scalable architecture with a hybrid Spatial/Winograd convolution (CONV) Processing Engine (PE), a comprehensive design space exploration tool, and a complete design flow to fully support accelerator design and implementation. Experimental results show that the accelerators generated by HybridDNN can deliver 3375.7 and 83.3 GOPS on a high-end FPGA (VU9P) and an embedded FPGA (PYNQ-Z1), respectively, which achieve a 1.8x higher performance improvement compared to the state-of-art accelerator designs. This demonstrates that HybridDNN is flexible and scalable and can target both cloud and embedded hardware platforms with vastly different resource constraints.

47.DashCam Pay: A System for In-vehicle Payments Using Face and Voice ⬇️

We present an open loop system, called DashCam Pay, that enables in-vehicle payments using face and voice biometrics. The system uses a plug-and-play device (dashcam) mounted in the vehicle to capture face images and voice commands of passengers. The dashcam is connected to mobile devices of passengers sitting in the vehicle, and uses privacy-preserving biometric comparison techniques to compare the biometric data captured by the dashcam with the biometric data enrolled on the users' mobile devices to determine the payer. Once the payer is verified, payment is initiated via the mobile device of the payer. For initial feasibility analysis, we collected data from 20 different subjects at two different sites using a commercially available dashcam, and evaluated open-source biometric algorithms on the collected data. Subsequently, we built an android prototype of the proposed system using open-source software packages to demonstrate the utility of the proposed system in facilitating secure in-vehicle payments. DashCam Pay can be integrated either by dashcam or vehicle manufacturers to enable open loop in-vehicle payments. We also discuss the applicability of the system to other payments scenarios, such as in-store payments.

48.Understanding Knowledge Gaps in Visual Question Answering: Implications for Gap Identification and Testing ⬇️

Visual Question Answering (VQA) systems are tasked with answering natural language questions corresponding to a presented image. Current VQA datasets typically contain questions related to the spatial information of objects, object attributes, or general scene questions. Recently, researchers have recognized the need for improving the balance of such datasets to reduce the system's dependency on memorized linguistic features and statistical biases and to allow for improved visual understanding. However, it is unclear as to whether there are any latent patterns that can be used to quantify and explain these failures. To better quantify our understanding of the performance of VQA models, we use a taxonomy of Knowledge Gaps (KGs) to identify/tag questions with one or more types of KGs. Each KG describes the reasoning abilities needed to arrive at a resolution, and failure to resolve gaps indicate an absence of the required reasoning ability. After identifying KGs for each question, we examine the skew in the distribution of the number of questions for each KG. In order to reduce the skew in the distribution of questions across KGs, we introduce a targeted question generation model. This model allows us to generate new types of questions for an image.

49.COVID_MTNet: COVID-19 Detection with Multi-Task Deep Learning Approaches ⬇️

COVID-19 is currently one the most life-threatening problems around the world. The fast and accurate detection of the COVID-19 infection is essential to identify, take better decisions and ensure treatment for the patients which will help save their lives. In this paper, we propose a fast and efficient way to identify COVID-19 patients with multi-task deep learning (DL) methods. Both X-ray and CT scan images are considered to evaluate the proposed technique. We employ our Inception Residual Recurrent Convolutional Neural Network with Transfer Learning (TL) approach for COVID-19 detection and our NABLA-N network model for segmenting the regions infected by COVID-19. The detection model shows around 84.67% testing accuracy from X-ray images and 98.78% accuracy in CT-images. A novel quantitative analysis strategy is also proposed in this paper to determine the percentage of infected regions in X-ray and CT images. The qualitative and quantitative results demonstrate promising results for COVID-19 detection and infected region localization.

50.e-SNLI-VE-2.0: Corrected Visual-Textual Entailment with Natural Language Explanations ⬇️

The recently proposed SNLI-VE corpus for recognising visual-textual entailment is a large, real-world dataset for fine-grained multimodal reasoning. However, the automatic way in which SNLI-VE has been assembled (via combining parts of two related datasets) gives rise to a large number of errors in the labels of this corpus. In this paper, we first present a data collection effort to correct the class with the highest error rate in SNLI-VE. Secondly, we re-evaluate an existing model on the corrected corpus, which we call SNLI-VE-2.0, and provide a quantitative comparison with its performance on the non-corrected corpus. Thirdly, we introduce e-SNLI-VE-2.0, which appends human-written natural language explanations to SNLI-VE-2.0. Finally, we train models that learn from these explanations at training time, and output such explanations at testing time.

51.Channel Attention Residual U-Net for Retinal Vessel Segmentation ⬇️

Retinal vessel segmentation is a vital step for the diagnosis of many early eye-related diseases. In this work, we propose a new deep learning model, namely Channel Attention Residual U-Net (CAR-U-Net), to accurately segment retinal vascular and non-vascular pixels. In this model, the channel attention mechanism was introduced into Residual Block and a Channel Attention Residual Block (CARB) was proposed to enhance the discriminative ability of the network by considering the interdependence between the feature channels. Moreover, to prevent the convolutional networks from overfitting, a Structured Dropout Residual Block (SDRB) was proposed, consisting of pre-activated residual block and DropBlock. The results show that our proposed CAR-U-Net has reached the state-of-the-art performance on two publicly available retinal vessel datasets: DRIVE and CHASE DB1.

52.Coronavirus (COVID-19) Classification using Deep Features Fusion and Ranking Technique ⬇️

Coronavirus (COVID-19) emerged towards the end of 2019. World Health Organization (WHO) was identified it as a global epidemic. Consensus occurred in the opinion that using Computerized Tomography (CT) techniques for early diagnosis of pandemic disease gives both fast and accurate results. It was stated by expert radiologists that COVID-19 displays different behaviours in CT images. In this study, a novel method was proposed as fusing and ranking deep features to detect COVID-19 in early phase. 16x16 (Subset-1) and 32x32 (Subset-2) patches were obtained from 150 CT images to generate sub-datasets. Within the scope of the proposed method, 3000 patch images have been labelled as CoVID-19 and No finding for using in training and testing phase. Feature fusion and ranking method have been applied in order to increase the performance of the proposed method. Then, the processed data was classified with a Support Vector Machine (SVM). According to other pre-trained Convolutional Neural Network (CNN) models used in transfer learning, the proposed method shows high performance on Subset-2 with 98.27% accuracy, 98.93% sensitivity, 97.60% specificity, 97.63% precision, 98.28% F1-score and 96.54% Matthews Correlation Coefficient (MCC) metrics.

53.Dense Residual Network for Retinal Vessel Segmentation ⬇️

Retinal vessel segmentation plays an imaportant role in the field of retinal image analysis because changes in retinal vascular structure can aid in the diagnosis of diseases such as hypertension and diabetes. In recent research, numerous successful segmentation methods for fundus images have been proposed. But for other retinal imaging modalities, more research is needed to explore vascular extraction. In this work, we propose an efficient method to segment blood vessels in Scanning Laser Ophthalmoscopy (SLO) retinal images. Inspired by U-Net, "feature map reuse" and residual learning, we propose a deep dense residual network structure called DRNet. In DRNet, feature maps of previous blocks are adaptively aggregated into subsequent layers as input, which not only facilitates spatial reconstruction, but also learns more efficiently due to more stable gradients. Furthermore, we introduce DropBlock to alleviate the overfitting problem of the network. We train and test this model on the recent SLO public dataset. The results show that our method achieves the state-of-the-art performance even without data augmentation.

54.SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation ⬇️

The precise segmentation of retinal blood vessel is of great significance for early diagnosis of eye-related diseases such as diabetes and hypertension. In this work, we propose a lightweight network named Spatial Attention U-Net (SA-UNet) that does not require thousands of annotated training samples and can be utilized in a data augmentation manner to use the available annotated samples more efficiently. SA-UNet introduces a spatial attention module which infers the attention map along the spatial dimension, and then multiply the attention map by the input feature map for adaptive feature refinement. In addition, the proposed network employees a kind of structured dropout convolutional block instead of the original convolutional block of U-Net to prevent the network from overfitting. We evaluate SA-UNet based on two benchmark retinal datasets: the Vascular Extraction (DRIVE) dataset and the Child Heart and Health Study (CHASE_DB1) dataset. The results show that our proposed SA-UNet achieves the state-of-the-art retinal vessel segmentation accuracy on both datasets.

55.Spatio-temporal Learning from Longitudinal Data for Multiple Sclerosis Lesion Segmentation ⬇️

Segmentation of Multiple Sclerosis (MS) lesions in longitudinal brain MR scans is performed for monitoring the progression of MS lesions. In order to improve segmentation, we use spatio-temporal cues in longitudinal data. To that end, we propose two approaches: Our longitudinal segmentation architecture which is grounded upon early-fusion of longitudinal data. And complementary to the longitudinal architecture, we propose a novel multi-task learning approach by defining an auxiliary self-supervised task of deformable registration between two time-points to guide the neural network toward learning from spatio-temporal changes. We show the effectiveness of our methods on two datasets: An in-house dataset comprised of 70 patients with one follow-up study for each patient and the ISBI longitudinal MS lesion segmentation challenge dataset which has 19 patients with three to five follow-up studies. Our results show that spatio-temporal information in longitudinal data is a beneficial cue for improving segmentation. Code is publicly available.

56.Query-controllable Video Summarization ⬇️

When video collections become huge, how to explore both within and across videos efficiently is challenging. Video summarization is one of the ways to tackle this issue. Traditional summarization approaches limit the effectiveness of video exploration because they only generate one fixed video summary for a given input video independent of the information need of the user. In this work, we introduce a method which takes a text-based query as input and generates a video summary corresponding to it. We do so by modeling video summarization as a supervised learning problem and propose an end-to-end deep learning based method for query-controllable video summarization to generate a query-dependent video summary. Our proposed method consists of a video summary controller, video summary generator, and video summary output module. To foster the research of query-controllable video summarization and conduct our experiments, we introduce a dataset that contains frame-based relevance score labels. Based on our experimental result, it shows that the text-based query helps control the video summary. It also shows the text-based query improves our model performance. Our code and dataset: this https URL.

57.The relationship between Fully Connected Layers and number of classes for the analysis of retinal images ⬇️

This paper experiments with the number of fully-connected layers in a deep convolutional neural network as applied to the classification of fundus retinal images. The images analysed corresponded to the ODIR 2019 (Peking University International Competition on Ocular Disease Intelligent Recognition) [9], which included images of various eye diseases (cataract, glaucoma, myopia, diabetic retinopathy, age-related macular degeneration (AMD), hypertension) as well as normal cases. This work focused on the classification of Normal, Cataract, AMD and Myopia. The feature extraction (convolutional) part of the neural network is kept the same while the feature mapping (linear) part of the network is changed. Different data sets are also explored on these neural nets. Each data set differs from another by the number of classes it has. This paper hence aims to find the relationship between number of classes and number of fully-connected layers. It was found out that the effect of increasing the number of fully-connected layers of a neural networks depends on the type of data set being used. For simple, linearly separable data sets, addition of fully-connected layer is something that should be explored and that could result in better training accuracy, but a direct correlation was not found. However as complexity of the data set goes up(more overlapping classes), increasing the number of fully-connected layers causes the neural network to stop learning. This phenomenon happens quicker the more complex the data set is.