Skip to content

Latest commit

 

History

History
65 lines (65 loc) · 39.5 KB

20210603.md

File metadata and controls

65 lines (65 loc) · 39.5 KB

ArXiv cs.CV --Thu, 3 Jun 2021

1.The Semi-Supervised iNaturalist Challenge at the FGVC8 Workshop ⬇️

Semi-iNat is a challenging dataset for semi-supervised classification with a long-tailed distribution of classes, fine-grained categories, and domain shifts between labeled and unlabeled data. This dataset is behind the second iteration of the semi-supervised recognition challenge to be held at the FGVC8 workshop at CVPR 2021. Different from the previous one, this dataset (i) includes images of species from different kingdoms in the natural taxonomy, (ii) is at a larger scale --- with 810 in-class and 1629 out-of-class species for a total of 330k images, and (iii) does not provide in/out-of-class labels, but provides coarse taxonomic labels (kingdom and phylum) for the unlabeled images. This document describes baseline results and the details of the dataset which is available here: \url{this https URL}.

2.Data augmentation and pre-trained networks for extremely low data regimes unsupervised visual inspection ⬇️

The use of deep features coming from pre-trained neural networks for unsupervised anomaly detection purposes has recently gathered momentum in the computer vision field. In particular, industrial inspection applications can take advantage of such features, as demonstrated by the multiple successes of related methods on the MVTec Anomaly Detection (MVTec AD) dataset. These methods make use of neural networks pre-trained on auxiliary classification tasks such as ImageNet. However, to our knowledge, no comparative study of robustness to the low data regimes between these approaches has been conducted yet. For quality inspection applications, the handling of limited sample sizes may be crucial as large quantities of images are not available for small series. In this work, we aim to compare three approaches based on deep pre-trained features when varying the quantity of available data in MVTec AD: KNN, Mahalanobis, and PaDiM. We show that although these methods are mostly robust to small sample sizes, they still can benefit greatly from using data augmentation in the original image space, which allows to deal with very small production runs.

3.Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision ⬇️

In this paper, we study the semi-supervised semantic segmentation problem via exploring both labeled data and extra unlabeled data. We propose a novel consistency regularization approach, called cross pseudo supervision (CPS). Our approach imposes the consistency on two segmentation networks perturbed with different initialization for the same input image. The pseudo one-hot label map, output from one perturbed segmentation network, is used to supervise the other segmentation network with the standard cross-entropy loss, and vice versa. The CPS consistency has two roles: encourage high similarity between the predictions of two perturbed networks for the same input image, and expand training data by using the unlabeled data with pseudo labels. Experiment results show that our approach achieves the state-of-the-art semi-supervised segmentation performance on Cityscapes and PASCAL VOC 2012.

4.DFGC 2021: A DeepFake Game Competition ⬇️

This paper presents a summary of the DFGC 2021 competition. DeepFake technology is developing fast, and realistic face-swaps are increasingly deceiving and hard to detect. At the same time, DeepFake detection methods are also improving. There is a two-party game between DeepFake creators and detectors. This competition provides a common platform for benchmarking the adversarial game between current state-of-the-art DeepFake creation and detection methods. In this paper, we present the organization, results and top solutions of this competition and also share our insights obtained during this event. We also release the DFGC-21 testing dataset collected from our participants to further benefit the research community.

5.ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection ⬇️

In this paper, we introduce the task of multi-view RGB-based 3D object detection as an end-to-end optimization problem. To address this problem, we propose ImVoxelNet, a novel fully convolutional method of 3D object detection based on monocular or multi-view RGB images. The number of monocular images in each multi-view input can variate during training and inference; actually, this number might be unique for each multi-view input. ImVoxelNet successfully handles both indoor and outdoor scenes, which makes it general-purpose. Specifically, it achieves state-of-the-art results in car detection on KITTI (monocular) and nuScenes (multi-view) benchmarks among all methods that accept RGB images. Moreover, it surpasses existing RGB-based 3D object detection methods on the SUN RGB-D dataset. On ScanNet, ImVoxelNet sets a new benchmark for multi-view 3D object detection. The source code and the trained models are available at \url{this https URL}.

6.Online and Real-Time Tracking in a Surveillance Scenario ⬇️

This paper presents an approach for tracking in a surveillance scenario. Typical aspects for this scenario are a 24/7 operation with a static camera mounted above the height of a human with many objects or people. The Multiple Object Tracking Benchmark 20 (MOT20) reflects this scenario best. We can show that our approach is real-time capable on this benchmark and outperforms all other real-time capable approaches in HOTA, MOTA, and IDF1. We achieve this by contributing a fast Siamese network reformulated for linear runtime (instead of quadratic) to generate fingerprints from detections. Thus, it is possible to associate the detections to Kalman filters based on multiple tracking specific ratings: Cosine similarity of fingerprints, Intersection over Union, and pixel distance ratio in the image.

7.Benchmarking CNN on 3D Anatomical Brain MRI: Architectures, Data Augmentation and Deep Ensemble Learning ⬇️

Deep Learning (DL) and specifically CNN models have become a de facto method for a wide range of vision tasks, outperforming traditional machine learning (ML) methods. Consequently, they drew a lot of attention in the neuroimaging field in particular for phenotype prediction or computer-aided diagnosis. However, most of the current studies often deal with small single-site cohorts, along with a specific pre-processing pipeline and custom CNN architectures, which make them difficult to compare to. We propose an extensive benchmark of recent state-of-the-art (SOTA) 3D CNN, evaluating also the benefits of data augmentation and deep ensemble learning, on both Voxel-Based Morphometry (VBM) pre-processing and quasi-raw images. Experiments were conducted on a large multi-site 3D brain anatomical MRI data-set comprising N=10k scans on 3 challenging tasks: age prediction, sex classification, and schizophrenia diagnosis. We found that all models provide significantly better predictions with VBM images than quasi-raw data. This finding evolved as the training set approaches 10k samples where quasi-raw data almost reach the performance of VBM. Moreover, we showed that linear models perform comparably with SOTA CNN on VBM data. We also demonstrated that DenseNet and tiny-DenseNet, a lighter version that we proposed, provide a good compromise in terms of performance in all data regime. Therefore, we suggest to employ them as the architectures by default. Critically, we also showed that current CNN are still very biased towards the acquisition site, even when trained with N=10k multi-site images. In this context, VBM pre-processing provides an efficient way to limit this site effect. Surprisingly, we did not find any clear benefit from data augmentation techniques. Finally, we proved that deep ensemble learning is well suited to re-calibrate big CNN models without sacrificing performance.

8.Towards Robust Classification Model by Counterfactual and Invariant Data Generation ⬇️

Despite the success of machine learning applications in science, industry, and society in general, many approaches are known to be non-robust, often relying on spurious correlations to make predictions. Spuriousness occurs when some features correlate with labels but are not causal; relying on such features prevents models from generalizing to unseen environments where such correlations break. In this work, we focus on image classification and propose two data generation processes to reduce spuriousness. Given human annotations of the subset of the features responsible (causal) for the labels (e.g. bounding boxes), we modify this causal set to generate a surrogate image that no longer has the same label (i.e. a counterfactual image). We also alter non-causal features to generate images still recognized as the original labels, which helps to learn a model invariant to these features. In several challenging datasets, our data generations outperform state-of-the-art methods in accuracy when spurious correlations break, and increase the saliency focus on causal features providing better explanations.

9.TSI: Temporal Saliency Integration for Video Action Recognition ⬇️

Efficient spatiotemporal modeling is an important yet challenging problem for video action recognition. Existing state-of-the-art methods exploit motion clues to assist in short-term temporal modeling through temporal difference over consecutive frames. However, background noises will be inevitably introduced due to the camera movement. Besides, movements of different actions can vary greatly. In this paper, we propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module. Specifically, SME aims to highlight the motion-sensitive area through local-global motion modeling, where the background suppression and pyramidal feature difference are conducted successively between neighboring frames to capture motion dynamics with less background noises. CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively. Meanwhile, temporal interactions across different scales are integrated with attention mechanism. Through these two modules, long short-term temporal relationships can be encoded efficiently by introducing limited additional parameters. Extensive experiments are conducted on several popular benchmarks (i.e., Something-Something v1 & v2, Kinetics-400, UCF-101, and HMDB-51), which demonstrate the effectiveness and superiority of our proposed method.

10.Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation ⬇️

Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks first place on CVPR2021 Referring Youtube-VOS challenge.

11.A Novel Edge Detection Operator for Identifying Buildings in Augmented Reality Applications ⬇️

Augmented Reality is an environment-enhancing technology, widely applied in many domains, such as tourism and culture. One of the major challenges in this field is precise detection and extraction of building information through Computer Vision techniques. Edge detection is one of the building blocks operations for many feature extraction solutions in Computer Vision. AR systems use edge detection for building extraction or for extraction of facade details from buildings. In this paper, we propose a novel filter operator for edge detection that aims to extract building contours or facade features better. The proposed filter gives more weight for finding vertical and horizontal edges that is an important feature for our aim.

12.Towards Unified Surgical Skill Assessment ⬇️

Surgical skills have a great influence on surgical safety and patients' well-being. Traditional assessment of surgical skills involves strenuous manual efforts, which lacks efficiency and repeatability. Therefore, we attempt to automatically predict how well the surgery is performed using the surgical video. In this paper, a unified multi-path framework for automatic surgical skill assessment is proposed, which takes care of multiple composing aspects of surgical skills, including surgical tool usage, intraoperative event pattern, and other skill proxies. The dependency relationships among these different aspects are specially modeled by a path dependency module in the framework. We conduct extensive experiments on the JIGSAWS dataset of simulated surgical tasks, and a new clinical dataset of real laparoscopic surgeries. The proposed framework achieves promising results on both datasets, with the state-of-the-art on the simulated dataset advanced from 0.71 Spearman's correlation to 0.80. It is also shown that combining multiple skill aspects yields better performance than relying on a single aspect.

13.Feedback Network for Mutually Boosted Stereo Image Super-Resolution and Disparity Estimation ⬇️

Under stereo settings, the problem of image super-resolution (SR) and disparity estimation are interrelated that the result of each problem could help to solve the other. The effective exploitation of correspondence between different views facilitates the SR performance, while the high-resolution (HR) features with richer details benefit the correspondence estimation. According to this motivation, we propose a Stereo Super-Resolution and Disparity Estimation Feedback Network (SSRDE-FNet), which simultaneously handles the stereo image super-resolution and disparity estimation in a unified framework and interact them with each other to further improve their performance. Specifically, the SSRDE-FNet is composed of two dual recursive sub-networks for left and right views. Besides the cross-view information exploitation in the low-resolution (LR) space, HR representations produced by the SR process are utilized to perform HR disparity estimation with higher accuracy, through which the HR features can be aggregated to generate a finer SR result. Afterward, the proposed HR Disparity Information Feedback (HRDIF) mechanism delivers information carried by HR disparity back to previous layers to further refine the SR image reconstruction. Extensive experiments demonstrate the effectiveness and advancement of SSRDE-FNet.

14.End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net ⬇️

Information extraction from document images has received a lot of attention recently, due to the need for digitizing a large volume of unstructured documents such as invoices, receipts, bank transfers, etc. In this paper, we propose a novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document, namely the \textit{Multi-Stage Attentional U-Net}. To effectively capture the textual and spatial relations between 2D elements, our model leverages a specialized multi-stage encoder-decoders design, in conjunction with efficient uses of the self-attention mechanism and the box convolution. Experimental results on different datasets show that our model outperforms the baseline U-Net architecture by a large margin while using 40% fewer parameters. Moreover, it also significantly improved the baseline in erroneous OCR and limited training data scenario, thus becomes practical for real-world applications.

15.Consumer Image Quality Prediction using Recurrent Neural Networks for Spatial Pooling ⬇️

Promising results for subjective image quality prediction have been achieved during the past few years by using convolutional neural networks (CNN). However, the use of CNNs for high resolution image quality assessment remains a challenge, since typical CNN architectures have been designed for small resolution input images. In this study, we propose an image quality model that attempts to mimic the attention mechanism of human visual system (HVS) by using a recurrent neural network (RNN) for spatial pooling of the features extracted from different spatial areas (patches) by a deep CNN-based feature extractor. The experimental study, conducted by using images with different resolutions from two recently published image quality datasets, indicates that the quality prediction accuracy of the proposed method is competitive against benchmark models representing the state-of-the-art, and the proposed method also performs consistently on different resolution versions of the same dataset.

16.Translational Symmetry-Aware Facade Parsing for 3D Building Reconstruction ⬇️

Effectively parsing the facade is essential to 3D building reconstruction, which is an important computer vision problem with a large amount of applications in high precision map for navigation, computer aided design, and city generation for digital entertainments. To this end, the key is how to obtain the shape grammars from 2D images accurately and efficiently. Although enjoying the merits of promising results on the semantic parsing, deep learning methods cannot directly make use of the architectural rules, which play an important role for man-made structures. In this paper, we present a novel translational symmetry-based approach to improving the deep neural networks. Our method employs deep learning models as the base parser, and a module taking advantage of translational symmetry is used to refine the initial parsing results. In contrast to conventional semantic segmentation or bounding box prediction, we propose a novel scheme to fuse segmentation with anchor-free detection in a single stage network, which enables the efficient training and better convergence. After parsing the facades into shape grammars, we employ an off-the-shelf rendering engine like Blender to reconstruct the realistic high-quality 3D models using procedural modeling. We conduct experiments on three public datasets, where our proposed approach outperforms the state-of-the-art methods. In addition, we have illustrated the 3D building models built from 2D facade images.

17.TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication ⬇️

Multiple instance learning (MIL) is a powerful tool to solve the weakly supervised classification in whole slide image (WSI) based pathology diagnosis. However, the current MIL methods are usually based on independent and identical distribution hypothesis, thus neglect the correlation among different instances. To address this problem, we proposed a new framework, called correlated MIL, and provided a proof for convergence. Based on this framework, we devised a Transformer based MIL (TransMIL), which explored both morphological and spatial information. The proposed TransMIL can effectively deal with unbalanced/balanced and binary/multiple classification with great visualization and interpretability. We conducted various experiments for three different computational pathology problems and achieved better performance and faster convergence compared with state-of-the-art methods. The test AUC for the binary tumor classification can be up to 93.09% over CAMELYON16 dataset. And the AUC over the cancer subtypes classification can be up to 96.03% and 98.82% over TCGA-NSCLC dataset and TCGA-RCC dataset, respectively.

18.Rotation Equivariant Feature Image Pyramid Network for Object Detection in Optical Remote Sensing Imagery ⬇️

Over the last few years, there has been substantial progress in object detection on remote sensing images (RSIs) where objects are generally distributed with large-scale variations and have different types of orientations. Nevertheless, most of the current convolution neural network approaches lack the ability to deal with the challenges such as size and rotation variations. To address these problems, we propose the rotation equivariant feature image pyramid network (REFIPN), an image pyramid network based on rotation equivariance convolution. The proposed pyramid network extracts features in a wide range of scales and orientations by using novel convolution filters. These features are used to generate vector fields and determine the weight and angle of the highest-scoring orientation for all spatial locations on an image. Finally, the extracted features go through the prediction layers of the detector. The detection performance of the proposed model is validated on two commonly used aerial benchmarks and the results show our propose model can achieve state-of-the-art performance with satisfactory efficiency.

19.Refining the bounding volumes for lossless compression of voxelized point clouds geometry ⬇️

This paper describes a novel lossless compression method for point cloud geometry, building on a recent lossy compression method that aimed at reconstructing only the bounding volume of a point cloud. The proposed scheme starts by partially reconstructing the geometry from the two depthmaps associated to a single projection direction. The partial reconstruction obtained from the depthmaps is completed to a full reconstruction of the point cloud by sweeping section by section along one direction and encoding the points which were not contained in the two depthmaps. The main ingredient is a list-based encoding of the inner points (situated inside the feasible regions) by a novel arithmetic three dimensional context coding procedure that efficiently utilizes rotational invariances present in the input data. State-of-the-art bits-per-voxel results are obtained on benchmark datasets.

20.nnDetection: A Self-configuring Method for Medical Object Detection ⬇️

Simultaneous localisation and categorization of objects in medical images, also referred to as medical object detection, is of high clinical relevance because diagnostic decisions often depend on rating of objects rather than e.g. pixels. For this task, the cumbersome and iterative process of method configuration constitutes a major research bottleneck. Recently, nnU-Net has tackled this challenge for the task of image segmentation with great success. Following nnU-Net's agenda, in this work we systematize and automate the configuration process for medical object detection. The resulting self-configuring method, nnDetection, adapts itself without any manual intervention to arbitrary medical detection problems while achieving results en par with or superior to the state-of-the-art. We demonstrate the effectiveness of nnDetection on two public benchmarks, ADAM and LUNA16, and propose 10 further medical object detection tasks on public data sets for comprehensive method evaluation. Code is at this https URL .

21.Cleaning and Structuring the Label Space of the iMet Collection 2020 ⬇️

The iMet 2020 dataset is a valuable resource in the space of fine-grained art attribution recognition, but we believe it has yet to reach its true potential. We document the unique properties of the dataset and observe that many of the attribute labels are noisy, more than is implied by the dataset description. Oftentimes, there are also semantic relationships between the labels (e.g., identical, mutual exclusion, subsumption, overlap with uncertainty) which we believe are underutilized. We propose an approach to cleaning and structuring the iMet 2020 labels, and discuss the implications and value of doing so. Further, we demonstrate the benefits of our proposed approach through several experiments. Our code and cleaned labels are available at this https URL.

22.Multi-task fully convolutional network for tree species mapping in dense forests using small training hyperspectral data ⬇️

This work proposes a multi-task fully convolutional architecture for tree species mapping in dense forests from sparse and scarce polygon-level annotations using hyperspectral UAV-borne data. Our model implements a partial loss function that enables dense tree semantic labeling outcomes from non-dense training samples, and a distance regression complementary task that enforces tree crown boundary constraints and substantially improves the model performance. Our multi-task architecture uses a shared backbone network that learns common representations for both tasks and two task-specific decoders, one for the semantic segmentation output and one for the distance map regression. We report that introducing the complementary task boosts the semantic segmentation performance compared to the single-task counterpart in up to 10% reaching an overall F1 score of 87.5% and an overall accuracy of 85.9%, achieving state-of-art performance for tree species classification in tropical forests.

23.ICDAR 2021 Competition on On-Line Signature Verification ⬇️

This paper describes the experimental framework and results of the ICDAR 2021 Competition on On-Line Signature Verification (SVC 2021). The goal of SVC 2021 is to evaluate the limits of on-line signature verification systems on popular scenarios (office/mobile) and writing inputs (stylus/finger) through large-scale public databases. Three different tasks are considered in the competition, simulating realistic scenarios as both random and skilled forgeries are simultaneously considered on each task. The results obtained in SVC 2021 prove the high potential of deep learning methods. In particular, the best on-line signature verification system of SVC 2021 obtained Equal Error Rate (EER) values of 3.33% (Task 1), 7.41% (Task 2), and 6.04% (Task 3).
SVC 2021 will be established as an on-going competition, where researchers can easily benchmark their systems against the state of the art in an open common platform using large-scale public databases such as DeepSignDB and SVC2021_EvalDB, and standard experimental protocols.

24.Deep Clustering Activation Maps for Emphysema Subtyping ⬇️

We propose a deep learning clustering method that exploits dense features from a segmentation network for emphysema subtyping from computed tomography (CT) scans. Using dense features enables high-resolution visualization of image regions corresponding to the cluster assignment via dense clustering activation maps (dCAMs). This approach provides model interpretability. We evaluated clustering results on 500 subjects from the COPDGenestudy, where radiologists manually annotated emphysema sub-types according to their visual CT assessment. We achieved a 43% unsupervised clustering accuracy, outperforming our baseline at 41% and yielding results comparable to supervised classification at 45%. The proposed method also offers a better cluster formation than the baseline, achieving0.54 in silhouette coefficient and 0.55 in David-Bouldin scores.

25.Digital homotopy relations and digital homology theories ⬇️

In this paper we prove results relating to two homotopy relations and four homology theories developed in the topology of digital images.
We introduce a new type of homotopy relation for digitally continuous functions which we call "strong homotopy." Both digital homotopy and strong homotopy are natural digitizations of classical topological homotopy: the difference between them is analogous to the difference between digital 4-adjacency and 8-adjacency in the plane.
We also consider four different digital homology theories: a simplicial homology theory by Arslan et al which is the homology of the clique complex, a singular simplicial homology theory by D. W. Lee, a cubical homology theory by Jamil and Ali, and a new kind of cubical homology for digital images with $c_1$-adjacency which is easily computed, and generalizes a construction by Karaca & Ege. We show that the two simplicial homology theories are isomorphic to each other, but distinct from the two cubical theories.
We also show that homotopic maps have the same induced homomorphisms in the cubical homology theory, and strong homotopic maps additionally have the same induced homomorphisms in the simplicial theory.

26.Deep Learning based Full-reference and No-reference Quality Assessment Models for Compressed UGC Videos ⬇️

In this paper, we propose a deep learning based video quality assessment (VQA) framework to evaluate the quality of the compressed user's generated content (UGC) videos. The proposed VQA framework consists of three modules, the feature extraction module, the quality regression module, and the quality pooling module. For the feature extraction module, we fuse the features from intermediate layers of the convolutional neural network (CNN) network into final quality-aware feature representation, which enables the model to make full use of visual information from low-level to high-level. Specifically, the structure and texture similarities of feature maps extracted from all intermediate layers are calculated as the feature representation for the full reference (FR) VQA model, and the global mean and standard deviation of the final feature maps fused by intermediate feature maps are calculated as the feature representation for the no reference (NR) VQA model. For the quality regression module, we use the fully connected (FC) layer to regress the quality-aware features into frame-level scores. Finally, a subjectively-inspired temporal pooling strategy is adopted to pool frame-level scores into the video-level score. The proposed model achieves the best performance among the state-of-the-art FR and NR VQA models on the Compressed UGC VQA database and also achieves pretty good performance on the in-the-wild UGC VQA databases.

27.Prediction of the Position of External Markers Using a Recurrent Neural Network Trained With Unbiased Online Recurrent Optimization for Safe Lung Cancer Radiotherapy ⬇️

During lung cancer radiotherapy, the position of infrared reflective objects on the chest can be recorded to estimate the tumor location. However, radiotherapy systems usually have a latency inherent to robot control limitations that impedes the radiation delivery precision. Not taking this phenomenon into account may cause unwanted damage to healthy tissues and lead to side effects such as radiation pneumonitis. In this research, we use nine observation records of the three-dimensional position of three external markers on the chest and abdomen of healthy individuals breathing during intervals from 73s to 222s. The sampling frequency is equal to 10Hz and the amplitudes of the recorded trajectories range from 6mm to 40mm in the superior-inferior direction. We forecast the location of each marker simultaneously with a horizon value (the time interval in advance for which the prediction is made) between 0.1s and 2.0s, using a recurrent neural network (RNN) trained with unbiased online recurrent optimization (UORO). We compare its performance with an RNN trained with real-time recurrent learning, least mean squares (LMS), and offline linear regression. Training and cross-validation are performed during the first minute of each sequence. On average, UORO achieves the lowest root-mean-square (RMS) and maximum error, equal respectively to 1.3mm and 8.8mm, with a prediction time per time step lower than 2.8ms (Dell Intel core i9-9900K 3.60Ghz). Linear regression has the lowest RMS error for the horizon values 0.1s and 0.2s, followed by LMS for horizon values between 0.3s and 0.5s, and UORO for horizon values greater than 0.6s.

28.Online Coreset Selection for Rehearsal-based Continual Learning ⬇️

A dataset is a shred of crucial evidence to describe a task. However, each data point in the dataset does not have the same potential, as some of the data points can be more representative or informative than others. This unequal importance among the data points may have a large impact in rehearsal-based continual learning, where we store a subset of the training examples (coreset) to be replayed later to alleviate catastrophic forgetting. In continual learning, the quality of the samples stored in the coreset directly affects the model's effectiveness and efficiency. The coreset selection problem becomes even more important under realistic settings, such as imbalanced continual learning or noisy data scenarios. To tackle this problem, we propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration and trains them in an online manner. Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting. We validate the effectiveness of our coreset selection mechanism over various standard, imbalanced, and noisy datasets against strong continual learning baselines, demonstrating that it improves task adaptation and prevents catastrophic forgetting in a sample-efficient manner.

29.Tips and Tricks to Improve CNN-based Chest X-ray Diagnosis: A Survey ⬇️

Convolutional Neural Networks (CNNs) intrinsically requires large-scale data whereas Chest X-Ray (CXR) images tend to be data/annotation-scarce, leading to over-fitting. Therefore, based on our development experience and related work, this paper thoroughly introduces tricks to improve generalization in the CXR diagnosis: how to (i) leverage additional data, (ii) augment/distillate data, (iii) regularize training, and (iv) conduct efficient segmentation. As a development example based on such optimization techniques, we also feature LPIXEL's CNN-based CXR solution, EIRL Chest Nodule, which improved radiologists/non-radiologists' nodule detection sensitivity by 0.100/0.131, respectively, while maintaining specificity.

30.Self-supervised Lesion Change Detection and Localisation in Longitudinal Multiple Sclerosis Brain Imaging ⬇️

Longitudinal imaging forms an essential component in the management and follow-up of many medical conditions. The presence of lesion changes on serial imaging can have significant impact on clinical decision making, highlighting the important role for automated change detection. Lesion changes can represent anomalies in serial imaging, which implies a limited availability of annotations and a wide variety of possible changes that need to be considered. Hence, we introduce a new unsupervised anomaly detection and localisation method trained exclusively with serial images that do not contain any lesion changes. Our training automatically synthesises lesion changes in serial images, introducing detection and localisation pseudo-labels that are used to self-supervise the training of our model. Given the rarity of these lesion changes in the synthesised images, we train the model with the imbalance robust focal Tversky loss. When compared to supervised models trained on different datasets, our method shows competitive performance in the detection and localisation of new demyelinating lesions on longitudinal magnetic resonance imaging in multiple sclerosis patients. Code for the models will be made available on GitHub.

31.Fourier Space Losses for Efficient Perceptual Image Super-Resolution ⬇️

Many super-resolution (SR) models are optimized for high performance only and therefore lack efficiency due to large model complexity. As large models are often not practical in real-world applications, we investigate and propose novel loss functions, to enable SR with high perceptual quality from much more efficient models. The representative power for a given low-complexity generator network can only be fully leveraged by strong guidance towards the optimal set of parameters. We show that it is possible to improve the performance of a recently introduced efficient generator architecture solely with the application of our proposed loss functions. In particular, we use a Fourier space supervision loss for improved restoration of missing high-frequency (HF) content from the ground truth image and design a discriminator architecture working directly in the Fourier domain to better match the target HF distribution. We show that our losses' direct emphasis on the frequencies in Fourier-space significantly boosts the perceptual image quality, while at the same time retaining high restoration quality in comparison to previously proposed loss functions for this task. The performance is further improved by utilizing a combination of spatial and frequency domain losses, as both representations provide complementary information during training. On top of that, the trained generator achieves comparable results with and is 2.4x and 48x faster than state-of-the-art perceptual SR methods RankSRGAN and SRFlow respectively.

32.Evaluating Recipes Generated from Functional Object-Oriented Network ⬇️

The functional object-oriented network (FOON) has been introduced as a knowledge representation, which takes the form of a graph, for symbolic task planning. To get a sequential plan for a manipulation task, a robot can obtain a task tree through a knowledge retrieval process from the FOON. To evaluate the quality of an acquired task tree, we compare it with a conventional form of task knowledge, such as recipes or manuals. We first automatically convert task trees to recipes, and we then compare them with the human-created recipes in the Recipe1M+ dataset via a survey. Our preliminary study finds no significant difference between the recipes in Recipe1M+ and the recipes generated from FOON task trees in terms of correctness, completeness, and clarity.