Skip to content

Latest commit

 

History

History
85 lines (85 loc) · 57.4 KB

20210528.md

File metadata and controls

85 lines (85 loc) · 57.4 KB

ArXiv cs.CV --Fri, 28 May 2021

1.Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering ⬇️

We present a novel approach for unsupervised activity segmentation, which uses video frame clustering as a pretext task and simultaneously performs representation learning and online clustering. This is in contrast with prior works where representation learning and online clustering are often performed sequentially. We leverage temporal information in videos by employing temporal optimal transport and temporal coherence loss. In particular, we incorporate a temporal regularization term into the standard optimal transport module, which preserves the temporal order of the activity, yielding the temporal optimal transport module for computing pseudo-label cluster assignments. Next, the temporal coherence loss encourages neighboring video frames to be mapped to nearby points while distant video frames are mapped to farther away points in the embedding space. The combination of these two components results in effective representations for unsupervised activity segmentation. Furthermore, previous methods require storing learned features for the entire dataset before clustering them in an offline manner, whereas our approach processes one mini-batch at a time in an online manner. Extensive evaluations on three public datasets, i.e. 50-Salads, YouTube Instructions, and Breakfast, and our dataset, i.e., Desktop Assembly, show that our approach performs on par or better than previous methods for unsupervised activity segmentation, despite having significantly less memory constraints.

2.Tracking Without Re-recognition in Humans and Machines ⬇️

Imagine trying to track one particular fruitfly in a swarm of hundreds. Higher biological visual systems have evolved to track moving objects by relying on both appearance and motion features. We investigate if state-of-the-art deep neural networks for visual tracking are capable of the same. For this, we introduce PathTracker, a synthetic visual challenge that asks human observers and machines to track a target object in the midst of identical-looking "distractor" objects. While humans effortlessly learn PathTracker and generalize to systematic variations in task design, state-of-the-art deep networks struggle. To address this limitation, we identify and model circuit mechanisms in biological brains that are implicated in tracking objects based on motion cues. When instantiated as a recurrent network, our circuit model learns to solve PathTracker with a robust visual strategy that rivals human performance and explains a significant proportion of their decision-making on the challenge. We also show that the success of this circuit model extends to object tracking in natural videos. Adding it to a transformer-based architecture for object tracking builds tolerance to visual nuisances that affect object appearance, resulting in a new state-of-the-art performance on the large-scale TrackingNet object tracking challenge. Our work highlights the importance of building artificial vision models that can help us better understand human vision and improve computer vision.

3.CogView: Mastering Text-to-Image Generation via Transformers ⬇️

Text-to-Image generation in the general domain has long been an open problem, which requires both generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView (zero-shot) achieves a new state-of-the-art FID on blurred MS COCO, outperforms previous GAN-based models and a recent similar work DALL-E.

4.Dynamic Network selection for the Object Detection task: why it matters and what we (didn't) achieve ⬇️

In this paper, we want to show the potential benefit of a dynamic auto-tuning approach for the inference process in the Deep Neural Network (DNN) context, tackling the object detection challenge. We benchmarked different neural networks to find the optimal detector for the well-known COCO 17 database, and we demonstrate that even if we only consider the quality of the prediction there is not a single optimal network. This is even more evident if we also consider the time to solution as a metric to evaluate, and then select, the most suitable network. This opens to the possibility for an adaptive methodology to switch among different object detection networks according to run-time requirements (e.g. maximum quality subject to a time-to-solution constraint).
Moreover, we demonstrated by developing an ad hoc oracle, that an additional proactive methodology could provide even greater benefits, allowing us to select the best network among the available ones given some characteristics of the processed image. To exploit this method, we need to identify some image features that can be used to steer the decision on the most promising network. Despite the optimization opportunity that has been identified, we were not able to identify a predictor function that validates this attempt neither adopting classical image features nor by using a DNN classifier.

5.ICDAR 2021 Competition on Historical Map Segmentation ⬇️

This paper presents the final results of the ICDAR 2021 Competition on Historical Map Segmentation (MapSeg), encouraging research on a series of historical atlases of Paris, France, drawn at 1/5000 scale between 1894 and 1937. The competition featured three tasks, awarded separately. Task1 consists in detecting building blocks and was won by the L3IRIS team using a DenseNet-121 network trained in a weakly supervised fashion. This task is evaluated on 3 large images containing hundreds of shapes to detect. Task2 consists in segmenting map content from the larger map sheet, and was won by the UWB team using a U-Net-like FCN combined with a binarization method to increase detection edge accuracy. Task3 consists in locating intersection points of geo-referencing lines, and was also won by the UWB team who used a dedicated pipeline combining binarization, line detection with Hough transform, candidate filtering, and template matching for intersection refinement. Tasks2 and~3 are evaluated on 95 map sheets with complex content. Dataset, evaluation tools and results are available under permissive licensing at \url{this https URL}.

6.How saccadic vision might help with theinterpretability of deep networks ⬇️

We describe how some problems (interpretability,lack of object-orientedness) of modern deep networks potentiallycould be solved by adapting a biologically plausible saccadicmechanism of perception. A sketch of such a saccadic visionmodel is proposed. Proof of concept experimental results areprovided to support the proposed approach.

7.Using Early-Learning Regularization to Classify Real-World Noisy Data ⬇️

The memorization problem is well-known in the field of computer vision. Liu et al. propose a technique called Early-Learning Regularization, which improves accuracy on the CIFAR datasets when label noise is present. This project replicates their experiments and investigates the performance on a real-world dataset with intrinsic noise. Results show that their experimental results are consistent. We also explore Sharpness-Aware Minimization in addition to SGD and observed a further 14.6 percentage points improvement. Future work includes using all 6 million images and manually clean a fraction of the images to fine-tune a transfer learning model. Last but not the least, having access to clean data for testing would also improve the measurement of accuracy.

8.A Dataset for Provident Vehicle Detection at Night ⬇️

In current object detection, algorithms require the object to be directly visible in order to be detected. As humans, however, we intuitively use visual cues caused by the respective object to already make assumptions about its appearance. In the context of driving, such cues can be shadows during the day and often light reflections at night. In this paper, we study the problem of how to map this intuitive human behavior to computer vision algorithms to detect oncoming vehicles at night just from the light reflections they cause by their headlights. For that, we present an extensive open-source dataset containing 59746 annotated grayscale images out of 346 different scenes in a rural environment at night. In these images, all oncoming vehicles, their corresponding light objects (e.g., headlamps), and their respective light reflections (e.g., light reflections on guardrails) are labeled. In this context, we discuss the characteristics of the dataset and the challenges in objectively describing visual cues such as light reflections. We provide different metrics for different ways to approach the task and report the results we achieved using state-of-the-art and custom object detection models as a first benchmark. With that, we want to bring attention to a new and so far neglected field in computer vision research, encourage more researchers to tackle the problem, and thereby further close the gap between human performance and computer vision systems.

9.Pose2Drone: A Skeleton-Pose-based Framework forHuman-Drone Interaction ⬇️

Drones have become a common tool, which is utilized in many tasks such as aerial photography, surveillance, and delivery. However, operating a drone requires more and more interaction with the user. A natural and safe method for Human-Drone Interaction (HDI) is using gestures. In this paper, we introduce an HDI framework building upon skeleton-based pose estimation. Our framework provides the functionality to control the movement of the drone with simple arm gestures and to follow the user while keeping a safe distance. We also propose a monocular distance estimation method, which is entirely based on image features and does not require any additional depth sensors. To perform comprehensive experiments and quantitative analysis, we create a customized testing dataset. The experiments indicate that our HDI framework can achieve an average of93.5% accuracy in the recognition of 11 common gestures. The code will be made publicly available to foster future research. Code is available at: this https URL

10.An Efficient Style Virtual Try on Network ⬇️

With the increasing development of garment manufacturing industry, the method of combining neural network with industry to reduce product redundancy has been paid more and more this http URL order to reduce garment redundancy and achieve personalized customization, more researchers have appeared in the field of virtual trying on.They try to transfer the target clothing to the reference figure, and then stylize the clothes to meet user's requirements for fashion.But the biggest problem of virtual try on is that the shape and motion blocking distort the clothes, causing the patterns and texture on the clothes to be impossible to restore. This paper proposed a new stylized virtual try on network, which can not only retain the authenticity of clothing texture and pattern, but also obtain the undifferentiated stylized try on. The network is divided into three sub-networks, the first is the user image, the front of the target clothing image, the semantic segmentation image and the posture heat map to generate a more detailed human parsing map. Second, UV position map and dense correspondence are used to map patterns and textures to the deformed silhouettes in real time, so that they can be retained in real time, and the rationality of spatial structure can be guaranteed on the basis of improving the authenticity of images. Third,Stylize and adjust the generated virtual try on image. Through the most subtle changes, users can choose the texture, color and style of clothing to improve the user's experience.

11.When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model ⬇️

In recent years, significant progress has been made in the research of facial landmark detection. However, few prior works have thoroughly discussed about models for practical applications. Instead, they often focus on improving a couple of issues at a time while ignoring the others. To bridge this gap, we aim to explore a practical model that is accurate, robust, efficient, generalizable, and end-to-end trainable at the same time. To this end, we first propose a baseline model equipped with one transformer decoder as detection head. In order to achieve a better accuracy, we further propose two lightweight modules, namely dynamic query initialization (DQInit) and query-aware memory (QAMem). Specifically, DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one. With the help of QAMem, our model removes the dependence on high-resolution feature maps and is still able to obtain superior accuracy. Extensive experiments and analysis on three popular benchmarks show the effectiveness and practical advantages of the proposed model. Notably, our model achieves new state of the art on WFLW as well as competitive results on 300W and COFW, while still running at 50+ FPS.

12.Blind Motion Deblurring Super-Resolution: When Dynamic Spatio-Temporal Learning Meets Static Image Understanding ⬇️

Single-image super-resolution (SR) and multi-frame SR are two ways to super resolve low-resolution images. Single-Image SR generally handles each image independently, but ignores the temporal information implied in continuing frames. Multi-frame SR is able to model the temporal dependency via capturing motion information. However, it relies on neighbouring frames which are not always available in the real world. Meanwhile, slight camera shake easily causes heavy motion blur on long-distance-shot low-resolution images. To address these problems, a Blind Motion Deblurring Super-Reslution Networks, BMDSRNet, is proposed to learn dynamic spatio-temporal information from single static motion-blurred images. Motion-blurred images are the accumulation over time during the exposure of cameras, while the proposed BMDSRNet learns the reverse process and uses three-streams to learn Bidirectional spatio-temporal information based on well designed reconstruction loss functions to recover clean high-resolution images. Extensive experiments demonstrate that the proposed BMDSRNet outperforms recent state-of-the-art methods, and has the ability to simultaneously deal with image deblurring and SR.

13.The Imaginative Generative Adversarial Network: Automatic Data Augmentation for Dynamic Skeleton-Based Hand Gesture and Human Action Recognition ⬇️

Deep learning approaches deliver state-of-the-art performance in recognition of spatiotemporal human motion data. However, one of the main challenges in these recognition tasks is limited available training data. Insufficient training data results in over-fitting and data augmentation is one approach to address this challenge. Existing data augmentation strategies, such as transformations including scaling, shifting and interpolating, require hyperparameter optimization that can easily cost hundreds of GPU hours. In this paper, we present a novel automatic data augmentation model, the Imaginative Generative Adversarial Network (GAN) that approximates the distribution of the input data and samples new data from this distribution. It is automatic in that it requires no data inspection and little hyperparameter tuning and therefore it is a low-cost and low-effort approach to generate synthetic data. The proposed data augmentation strategy is fast to train and the synthetic data leads to higher recognition accuracy than using data augmented with a classical approach.

14.SSAN: Separable Self-Attention Network for Video Representation Learning ⬇️

Self-attention has been successfully applied to video representation learning due to the effectiveness of modeling long range dependencies. Existing approaches build the dependencies merely by computing the pairwise correlations along spatial and temporal dimensions simultaneously. However, spatial correlations and temporal correlations represent different contextual information of scenes and temporal reasoning. Intuitively, learning spatial contextual information first will benefit temporal modeling. In this paper, we propose a separable self-attention (SSA) module, which models spatial and temporal correlations sequentially, so that spatial contexts can be efficiently used in temporal modeling. By adding SSA module into 2D CNN, we build a SSA network (SSAN) for video representation learning. On the task of video action recognition, our approach outperforms state-of-the-art methods on Something-Something and Kinetics-400 datasets. Our models often outperform counterparts with shallower network and fewer modalities. We further verify the semantic learning ability of our method in visual-language task of video retrieval, which showcases the homogeneity of video representations and text embeddings. On MSR-VTT and Youcook2 datasets, video representations learnt by SSA significantly improve the state-of-the-art performance.

15.Stylizing 3D Scene via Implicit Representation and HyperNetwork ⬇️

In this work, we aim to address the 3D scene stylization problem - generating stylized images of the scene at arbitrary novel view angles. A straightforward solution is to combine existing novel view synthesis and image/video style transfer approaches, which often leads to blurry results or inconsistent appearance. Inspired by the high quality results of the neural radiance fields (NeRF) method, we propose a joint framework to directly render novel views with the desired style. Our framework consists of two components: an implicit representation of the 3D scene with the neural radiance field model, and a hypernetwork to transfer the style information into the scene representation. In particular, our implicit representation model disentangles the scene into the geometry and appearance branches, and the hypernetwork learns to predict the parameters of the appearance branch from the reference style image. To alleviate the training difficulties and memory burden, we propose a two-stage training procedure and a patch sub-sampling approach to optimize the style and content losses with the neural radiance field model. After optimization, our model is able to render consistent novel views at arbitrary view angles with arbitrary style. Both quantitative evaluation and human subject study have demonstrated that the proposed method generates faithful stylization results with consistent appearance across different views.

16.PSRR-MaxpoolNMS: Pyramid Shifted MaxpoolNMS with Relationship Recovery ⬇️

Non-maximum Suppression (NMS) is an essential postprocessing step in modern convolutional neural networks for object detection. Unlike convolutions which are inherently parallel, the de-facto standard for NMS, namely GreedyNMS, cannot be easily parallelized and thus could be the performance bottleneck in convolutional object detection pipelines. MaxpoolNMS is introduced as a parallelizable alternative to GreedyNMS, which in turn enables faster speed than GreedyNMS at comparable accuracy. However, MaxpoolNMS is only capable of replacing the GreedyNMS at the first stage of two-stage detectors like Faster-RCNN. There is a significant drop in accuracy when applying MaxpoolNMS at the final detection stage, due to the fact that MaxpoolNMS fails to approximate GreedyNMS precisely in terms of bounding box selection. In this paper, we propose a general, parallelizable and configurable approach PSRR-MaxpoolNMS, to completely replace GreedyNMS at all stages in all detectors. By introducing a simple Relationship Recovery module and a Pyramid Shifted MaxpoolNMS module, our PSRR-MaxpoolNMS is able to approximate GreedyNMS more precisely than MaxpoolNMS. Comprehensive experiments show that our approach outperforms MaxpoolNMS by a large margin, and it is proven faster than GreedyNMS with comparable accuracy. For the first time, PSRR-MaxpoolNMS provides a fully parallelizable solution for customized hardware design, which can be reused for accelerating NMS everywhere.

17.Joint-DetNAS: Upgrade Your Detector with NAS, Pruning and Dynamic Distillation ⬇️

We propose Joint-DetNAS, a unified NAS framework for object detection, which integrates 3 key components: Neural Architecture Search, pruning, and Knowledge Distillation. Instead of naively pipelining these techniques, our Joint-DetNAS optimizes them jointly. The algorithm consists of two core processes: student morphism optimizes the student's architecture and removes the redundant parameters, while dynamic distillation aims to find the optimal matching teacher. For student morphism, weight inheritance strategy is adopted, allowing the student to flexibly update its architecture while fully utilize the predecessor's weights, which considerably accelerates the search; To facilitate dynamic distillation, an elastic teacher pool is trained via integrated progressive shrinking strategy, from which teacher detectors can be sampled without additional cost in subsequent searches. Given a base detector as the input, our algorithm directly outputs the derived student detector with high performance without additional training. Experiments demonstrate that our Joint-DetNAS outperforms the naive pipelining approach by a great margin. Given a classic R101-FPN as the base detector, Joint-DetNAS is able to boost its mAP from 41.4 to 43.9 on MS COCO and reduce the latency by 47%, which is on par with the SOTA EfficientDet while requiring less search cost. We hope our proposed method can provide the community with a new way of jointly optimizing NAS, KD and pruning.

18.Feature Reuse and Fusion for Real-time Semantic segmentation ⬇️

For real-time semantic segmentation, how to increase the speed while maintaining high resolution is a problem that has been discussed and solved. Backbone design and fusion design have always been two essential parts of real-time semantic segmentation. We hope to design a light-weight network based on previous design experience and reach the level of state-of-the-art real-time semantic segmentation without any pre-training. To achieve this goal, a encoder-decoder architectures are proposed to solve this problem by applying a decoder network onto a backbone model designed for real-time segmentation tasks and designed three different ways to fuse semantics and detailed information in the aggregation phase. We have conducted extensive experiments on two semantic segmentation benchmarks. Experiments on the Cityscapes and CamVid datasets show that the proposed FRFNet strikes a balance between speed calculation and accuracy. It achieves 76.4% Mean Intersection over Union (mIoU%) on the Cityscapes test dataset with the speed of 161 FPS on a single RTX 2080Ti card. The Code is available at this https URL.

19.Unsupervised Adaptive Semantic Segmentation with Local Lipschitz Constraint ⬇️

Recent advances in unsupervised domain adaptation have seen considerable progress in semantic segmentation. Existing methods either align different domains with adversarial training or involve the self-learning that utilizes pseudo labels to conduct supervised training. The former always suffers from the unstable training caused by adversarial training and only focuses on the inter-domain gap that ignores intra-domain knowledge. The latter tends to put overconfident label prediction on wrong categories, which propagates errors to more samples. To solve these problems, we propose a two-stage adaptive semantic segmentation method based on the local Lipschitz constraint that satisfies both domain alignment and domain-specific exploration under a unified principle. In the first stage, we propose the local Lipschitzness regularization as the objective function to align different domains by exploiting intra-domain knowledge, which explores a promising direction for non-adversarial adaptive semantic segmentation. In the second stage, we use the local Lipschitzness regularization to estimate the probability of satisfying Lipschitzness for each pixel, and then dynamically sets the threshold of pseudo labels to conduct self-learning. Such dynamical self-learning effectively avoids the error propagation caused by noisy labels. Optimization in both stages is based on the same principle, i.e., the local Lipschitz constraint, so that the knowledge learned in the first stage can be maintained in the second stage. Further, due to the model-agnostic property, our method can easily adapt to any CNN-based semantic segmentation networks. Experimental results demonstrate the excellent performance of our method on standard benchmarks.

20.YOLO5Face: Why Reinventing a Face Detector ⬇️

Tremendous progress has been made on face detection in recent years using convolutional neural networks. While many face detectors use designs designated for the detection of face, we treat face detection as a general object detection task. We implement a face detector based on YOLOv5 object detector and call it YOLO5Face. We add a five-point landmark regression head into it and use the Wing loss function. We design detectors with different model sizes, from a large model to achieve the best performance, to a super small model for real-time detection on an embedded or mobile device. Experiment results on the WiderFace dataset show that our face detectors can achieve state-of-the-art performance in almost all the Easy, Medium, and Hard subsets, exceeding the more complex designated face detectors. The code is available at \url{this https URL}.

21.Image-Based Plant Wilting Estimation ⬇️

Many plants become limp or droop through heat, loss of water, or disease. This is also known as wilting. In this paper, we examine plant wilting caused by bacterial infection. In particular, we want to design a metric for wilting based on images acquired of the plant. A quantifiable wilting metric will be useful in studying bacterial wilt and identifying resistance genes. Since there is no standard way to estimate wilting, it is common to use ad hoc visual scores. This is very subjective and requires expert knowledge of the plants and the disease mechanism. Our solution consists of using various wilting metrics acquired from RGB images of the plants. We also designed several experiments to demonstrate that our metrics are effective at estimating wilting in plants.

22.Self-Ensembling Contrastive Learning for Semi-Supervised Medical Image Segmentation ⬇️

Deep learning has demonstrated significant improvements in medical image segmentation using a sufficiently large amount of training data with manual labels. Acquiring well-representative labels requires expert knowledge and exhaustive labors. In this paper, we aim to boost the performance of semi-supervised learning for medical image segmentation with limited labels using a self-ensembling contrastive learning technique. To this end, we propose to train an encoder-decoder network at image-level with small amounts of labeled images, and more importantly, we learn latent representations directly at feature-level by imposing contrastive loss on unlabeled images. This method strengthens intra-class compactness and inter-class separability, so as to get a better pixel classifier. Moreover, we devise a student encoder for online learning and an exponential moving average version of it, called teacher encoder, to improve the performance iteratively in a self-ensembling manner. To construct contrastive samples with unlabeled images, two sampling strategies that exploit structure similarity across medical images and utilize pseudo-labels for construction, termed region-aware and anatomical-aware contrastive sampling, are investigated. We conduct extensive experiments on an MRI and a CT segmentation dataset and demonstrate that in a limited label setting, the proposed method achieves state-of-the-art performance. Moreover, the anatomical-aware strategy that prepares contrastive samples on-the-fly using pseudo-labels realizes better contrastive regularization on feature representations.

23.3D Segmentation Learning from Sparse Annotations and Hierarchical Descriptors ⬇️

One of the main obstacles to 3D semantic segmentation is the significant amount of endeavor required to generate expensive point-wise annotations for fully supervised training. To alleviate manual efforts, we propose GIDSeg, a novel approach that can simultaneously learn segmentation from sparse annotations via reasoning global-regional structures and individual-vicinal properties. GIDSeg depicts global- and individual- relation via a dynamic edge convolution network coupled with a kernelized identity descriptor. The ensemble effects are obtained by endowing a fine-grained receptive field to a low-resolution voxelized map. In our GIDSeg, an adversarial learning module is also designed to further enhance the conditional constraint of identity descriptors within the joint feature distribution. Despite the apparent simplicity, our proposed approach achieves superior performance over state-of-the-art for inferencing 3D dense segmentation with only sparse annotations. Particularly, with $5%$ annotations of raw data, GIDSeg outperforms other 3D segmentation methods.

24.i3dLoc: Image-to-range Cross-domain Localization Robust to Inconsistent Environmental Conditions ⬇️

We present a method for localizing a single camera with respect to a point cloud map in indoor and outdoor scenes. The problem is challenging because correspondences of local invariant features are inconsistent across the domains between image and 3D. The problem is even more challenging as the method must handle various environmental conditions such as illumination, weather, and seasonal changes. Our method can match equirectangular images to the 3D range projections by extracting cross-domain symmetric place descriptors. Our key insight is to retain condition-invariant 3D geometry features from limited data samples while eliminating the condition-related features by a designed Generative Adversarial Network. Based on such features, we further design a spherical convolution network to learn viewpoint-invariant symmetric place descriptors. We evaluate our method on extensive self-collected datasets, which involve \textit{Long-term} (variant appearance conditions), \textit{Large-scale} (up to $2km$ structure/unstructured environment), and \textit{Multistory} (four-floor confined space). Our method surpasses other current state-of-the-arts by achieving around $3$ times higher place retrievals to inconsistent environments, and above $3$ times accuracy on online localization. To highlight our method's generalization capabilities, we also evaluate the recognition across different datasets. With a single trained model, i3dLoc can demonstrate reliable visual localization in random conditions.

25.Benchmarking Scientific Image Forgery Detectors ⬇️

The scientific image integrity area presents a challenging research bottleneck, the lack of available datasets to design and evaluate forensic techniques. Its data sensitivity creates a legal hurdle that prevents one to rely on real tampered cases to build any sort of accessible forensic benchmark. To mitigate this bottleneck, we present an extendable open-source library that reproduces the most common image forgery operations reported by the research integrity community: duplication, retouching, and cleaning. Using this library and realistic scientific images, we create a large scientific forgery image benchmark (39,423 images) with an enriched ground-truth. In addition, concerned about the high number of retracted papers due to image duplication, this work evaluates the state-of-the-art copy-move detection methods in the proposed dataset, using a new metric that asserts consistent match detection between the source and the copied region. The dataset and source-code will be freely available upon acceptance of the paper.

26.Multi-Modal Semantic Inconsistency Detection in Social Media News Posts ⬇️

As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below 50% for uni-modal models. Further ablation studies confirm the necessity of fusion across modalities for correctly identifying semantic inconsistencies.

27.Issues in Object Detection in Videos using Common Single-Image CNNs ⬇️

A growing branch of computer vision is object detection. Object detection is used in many applications such as industrial process, medical imaging analysis, and autonomous vehicles. The ability to detect objects in videos is crucial. Object detection systems are trained on large image datasets. For applications such as autonomous vehicles, it is crucial that the object detection system can identify objects through multiple frames in video. There are many problems with applying these systems to video. Shadows or changes in brightness that can cause the system to incorrectly identify objects frame to frame and cause an unintended system response. There are many neural networks that have been used for object detection and if there was a way of connecting objects between frames then these problems could be eliminated. For these neural networks to get better at identifying objects in video, they need to be re-trained. A dataset must be created with images that represent consecutive video frames and have matching ground-truth layers. A method is proposed that can generate these datasets. The ground-truth layer contains only moving objects. To generate this layer, FlowNet2-Pytorch was used to create the flow mask using the novel Magnitude Method. As well, a segmentation mask will be generated using networks such as Mask R-CNN or Refinenet. These segmentation masks will contain all objects detected in a frame. By comparing this segmentation mask to the flow mask ground-truth layer, a loss function is generated. This loss function can be used to train a neural network to be better at making consistent predictions on video. The system was tested on multiple video samples and a loss was generated for each frame, proving the Magnitude Method's ability to be used to train object detection neural networks in future work.

28.ViPTT-Net: Video pretraining of spatio-temporal model for tuberculosis type classification from chest CT scans ⬇️

Pretraining has sparked groundswell of interest in deep learning workflows to learn from limited data and improve generalization. While this is common for 2D image classification tasks, its application to 3D medical imaging tasks like chest CT interpretation is limited. We explore the idea of whether pretraining a model on realistic videos could improve performance rather than training the model from scratch, intended for tuberculosis type classification from chest CT scans. To incorporate both spatial and temporal features, we develop a hybrid convolutional neural network (CNN) and recurrent neural network (RNN) model, where the features are extracted from each axial slice of the CT scan by a CNN, these sequence of image features are input to a RNN for classification of the CT scan. Our model termed as ViPTT-Net, was trained on over 1300 video clips with labels of human activities, and then fine-tuned on chest CT scans with labels of tuberculosis type. We find that pretraining the model on videos lead to better representations and significantly improved model validation performance from a kappa score of 0.17 to 0.35, especially for under-represented class samples. Our best method achieved 2nd place in the ImageCLEF 2021 Tuberculosis - TBT classification task with a kappa score of 0.20 on the final test set with only image information (without using clinical meta-data). All codes and models are made available.

29.DFPN: Deformable Frame Prediction Network ⬇️

Learned frame prediction is a current problem of interest in computer vision and video compression. Although several deep network architectures have been proposed for learned frame prediction, to the best of our knowledge, there is no work based on using deformable convolutions for frame prediction. To this effect, we propose a deformable frame prediction network (DFPN) for task oriented implicit motion modeling and next frame prediction. Experimental results demonstrate that the proposed DFPN model achieves state of the art results in next frame prediction. Our models and results are available at this https URL.

30.RSCA: Real-time Segmentation-based Context-Aware Scene Text Detection ⬇️

Segmentation-based scene text detection methods have been widely adopted for arbitrary-shaped text detection recently, since they make accurate pixel-level predictions on curved text instances and can facilitate real-time inference without time-consuming processing on anchors. However, current segmentation-based models are unable to learn the shapes of curved texts and often require complex label assignments or repeated feature aggregations for more accurate detection. In this paper, we propose RSCA: a Real-time Segmentation-based Context-Aware model for arbitrary-shaped scene text detection, which sets a strong baseline for scene text detection with two simple yet effective strategies: Local Context-Aware Upsampling and Dynamic Text-Spine Labeling, which model local spatial transformation and simplify label assignments separately. Based on these strategies, RSCA achieves state-of-the-art performance in both speed and accuracy, without complex label assignments or repeated feature aggregations. We conduct extensive experiments on multiple benchmarks to validate the effectiveness of our method. RSCA-640 reaches 83.9% F-measure at 48.3 FPS on CTW1500 dataset.

31.cofga: A Dataset for Fine Grained Classification of Objects from Aerial Imagery ⬇️

Detection and classification of objects in overhead images are two important and challenging problems in computer vision. Among various research areas in this domain, the task of fine-grained classification of objects in overhead images has become ubiquitous in diverse real-world applications, due to recent advances in high-resolution satellite and airborne imaging systems. The small inter-class variations and the large intra class variations caused by the fine grained nature make it a challenging task, especially in low-resource cases. In this paper, we introduce COFGA a new open dataset for the advancement of fine-grained classification research. The 2,104 images in the dataset are collected from an airborne imaging system at 5 15 cm ground sampling distance, providing higher spatial resolution than most public overhead imagery datasets. The 14,256 annotated objects in the dataset were classified into 2 classes, 15 subclasses, 14 unique features, and 8 perceived colors a total of 37 distinct labels making it suitable to the task of fine-grained classification more than any other publicly available overhead imagery dataset. We compare COFGA to other overhead imagery datasets and then describe some distinguished fine-grain classification approaches that were explored during an open data-science competition we have conducted for this task.

32.DSLR: Dynamic to Static LiDAR Scan Reconstruction Using Adversarially Trained Autoencoder ⬇️

Accurate reconstruction of static environments from LiDAR scans of scenes containing dynamic objects, which we refer to as Dynamic to Static Translation (DST), is an important area of research in Autonomous Navigation. This problem has been recently explored for visual SLAM, but to the best of our knowledge no work has been attempted to address DST for LiDAR scans. The problem is of critical importance due to wide-spread adoption of LiDAR in Autonomous Vehicles. We show that state-of the art methods developed for the visual domain when adapted for LiDAR scans perform poorly.
We develop DSLR, a deep generative model which learns a mapping between dynamic scan to its static counterpart through an adversarially trained autoencoder. Our model yields the first solution for DST on LiDAR that generates static scans without using explicit segmentation labels. DSLR cannot always be applied to real world data due to lack of paired dynamic-static scans. Using Unsupervised Domain Adaptation, we propose DSLR-UDA for transfer to real world data and experimentally show that this performs well in real world settings. Additionally, if segmentation information is available, we extend DSLR to DSLR-Seg to further improve the reconstruction quality.
DSLR gives the state of the art performance on simulated and real-world datasets and also shows at least 4x improvement. We show that DSLR, unlike the existing baselines, is a practically viable model with its reconstruction quality within the tolerable limits for tasks pertaining to autonomous navigation like SLAM in dynamic environments.

33.An Online Learning System for Wireless Charging Alignment using Surround-view Fisheye Cameras ⬇️

Electric Vehicles are increasingly common, with inductive chargepads being considered a convenient and efficient means of charging electric vehicles. However, drivers are typically poor at aligning the vehicle to the necessary accuracy for efficient inductive charging, making the automated alignment of the two charging plates desirable. In parallel to the electrification of the vehicular fleet, automated parking systems that make use of surround-view camera systems are becoming increasingly popular. In this work, we propose a system based on the surround-view camera architecture to detect, localize and automatically align the vehicle with the inductive chargepad. The visual design of the chargepads is not standardized and not necessarily known beforehand. Therefore a system that relies on offline training will fail in some situations. Thus we propose an online learning method that leverages the driver's actions when manually aligning the vehicle with the chargepad and combine it with weak supervision from semantic segmentation and depth to learn a classifier to auto-annotate the chargepad in the video for further training. In this way, when faced with a previously unseen chargepad, the driver needs only manually align the vehicle a single time. As the chargepad is flat on the ground, it is not easy to detect it from a distance. Thus, we propose using a Visual SLAM pipeline to learn landmarks relative to the chargepad to enable alignment from a greater range. We demonstrate the working system on an automated vehicle as illustrated in the video this https URL. To encourage further research, we will share a chargepad dataset used in this work.

34.Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error ⬇️

In computer vision, it is standard practice to draw a single sample from the data augmentation procedure for each unique image in the mini-batch, however it is not clear whether this choice is optimal for generalization. In this work, we provide a detailed empirical evaluation of how the number of augmentation samples per unique image influences performance on held out data. Remarkably, we find that drawing multiple samples per image consistently enhances the test accuracy achieved for both small and large batch training, despite reducing the number of unique training examples in each mini-batch. This benefit arises even when different augmentation multiplicities perform the same number of parameter updates and gradient evaluations. Our results suggest that, although the variance in the gradient estimate arising from subsampling the dataset has an implicit regularization benefit, the variance which arises from the data augmentation process harms test accuracy. By applying augmentation multiplicity to the recently proposed NFNet model family, we achieve a new ImageNet state of the art of 86.8$%$ top-1 w/o extra data.

35.Cardiac Segmentation on CT Images through Shape-Aware Contour Attentions ⬇️

Cardiac segmentation of atriums, ventricles, and myocardium in computed tomography (CT) images is an important first-line task for presymptomatic cardiovascular disease diagnosis. In several recent studies, deep learning models have shown significant breakthroughs in medical image segmentation tasks. Unlike other organs such as the lungs and liver, the cardiac organ consists of multiple substructures, i.e., ventricles, atriums, aortas, arteries, veins, and myocardium. These cardiac substructures are proximate to each other and have indiscernible boundaries (i.e., homogeneous intensity values), making it difficult for the segmentation network focus on the boundaries between the substructures. In this paper, to improve the segmentation accuracy between proximate organs, we introduce a novel model to exploit shape and boundary-aware features. We primarily propose a shape-aware attention module, that exploits distance regression, which can guide the model to focus on the edges between substructures so that it can outperform the conventional contour-based attention method. In the experiments, we used the Multi-Modality Whole Heart Segmentation dataset that has 20 CT cardiac images for training and validation, and 40 CT cardiac images for testing. The experimental results show that the proposed network produces more accurate results than state-of-the-art networks by improving the Dice similarity coefficient score by 4.97%. Our proposed shape-aware contour attention mechanism demonstrates that distance transformation and boundary features improve the actual attention map to strengthen the responses in the boundary area. Moreover, our proposed method significantly reduces the false-positive responses of the final output, resulting in accurate segmentation.

36.Graph-Based Deep Learning for Medical Diagnosis and Analysis: Past, Present and Future ⬇️

With the advances of data-driven machine learning research, a wide variety of prediction problems have been tackled. It has become critical to explore how machine learning and specifically deep learning methods can be exploited to analyse healthcare data. A major limitation of existing methods has been the focus on grid-like data; however, the structure of physiological recordings are often irregular and unordered which makes it difficult to conceptualise them as a matrix. As such, graph neural networks have attracted significant attention by exploiting implicit information that resides in a biological system, with interactive nodes connected by edges whose weights can be either temporal associations or anatomical junctions. In this survey, we thoroughly review the different types of graph architectures and their applications in healthcare. We provide an overview of these methods in a systematic manner, organized by their domain of application including functional connectivity, anatomical structure and electrical-based analysis. We also outline the limitations of existing techniques and discuss potential directions for future research.

37.Continual Learning at the Edge: Real-Time Training on Smartphone Devices ⬇️

On-device training for personalized learning is a challenging research problem. Being able to quickly adapt deep prediction models at the edge is necessary to better suit personal user needs. However, adaptation on the edge poses some questions on both the efficiency and sustainability of the learning process and on the ability to work under shifting data distributions. Indeed, naively fine-tuning a prediction model only on the newly available data results in catastrophic forgetting, a sudden erasure of previously acquired knowledge. In this paper, we detail the implementation and deployment of a hybrid continual learning strategy (AR1*) on a native Android application for real-time on-device personalization without forgetting. Our benchmark, based on an extension of the CORe50 dataset, shows the efficiency and effectiveness of our solution.

38.HDRUNet: Single Image HDR Reconstruction with Denoising and Dequantization ⬇️

Most consumer-grade digital cameras can only capture a limited range of luminance in real-world scenes due to sensor constraints. Besides, noise and quantization errors are often introduced in the imaging process. In order to obtain high dynamic range (HDR) images with excellent visual quality, the most common solution is to combine multiple images with different exposures. However, it is not always feasible to obtain multiple images of the same scene and most HDR reconstruction methods ignore the noise and quantization loss. In this work, we propose a novel learning-based approach using a spatially dynamic encoder-decoder network, HDRUNet, to learn an end-to-end mapping for single image HDR reconstruction with denoising and dequantization. The network consists of a UNet-style base network to make full use of the hierarchical multi-scale information, a condition network to perform pattern-specific modulation and a weighting network for selectively retaining information. Moreover, we propose a Tanh_L1 loss function to balance the impact of over-exposed values and well-exposed values on the network learning. Our method achieves the state-of-the-art performance in quantitative comparisons and visual quality. The proposed HDRUNet model won the second place in the single frame track of NITRE2021 High Dynamic Range Challenge.

39.Efficient High-Resolution Image-to-Image Translation using Multi-Scale Gradient U-Net ⬇️

Recently, Conditional Generative Adversarial Network (Conditional GAN) have shown very promising performance in several image-to-image translation applications. However, the uses of these conditional GANs are quite limited to low-resolution images, such as 256X256.The Pix2Pix-HD is a recent attempt to utilize the conditional GAN for high-resolution image synthesis. In this paper, we propose a Multi-Scale Gradient based U-Net (MSG U-Net) model for high-resolution image-to-image translation up to 2048X1024 resolution. The proposed model is trained by allowing the flow of gradients from multiple-discriminators to a single generator at multiple scales. The proposed MSG U-Net architecture leads to photo-realistic high-resolution image-to-image translation. Moreover, the proposed model is computationally efficient as com-pared to the Pix2Pix-HD with an improvement in the inference time nearly by 2.5 times. We provide the code of MSG U-Net model at this https URL.

40.Passing Multi-Channel Material Textures to a 3-Channel Loss ⬇️

Our objective is to compute a textural loss that can be used to train texture generators with multiple material channels typically used for physically based rendering such as albedo, normal, roughness, metalness, ambient occlusion, etc. Neural textural losses often build on top of the feature spaces of pretrained convolutional neural networks. Unfortunately, these pretrained models are only available for 3-channel RGB data and hence limit neural textural losses to this format. To overcome this limitation, we show that passing random triplets to a 3-channel loss provides a multi-channel loss that can be used to generate high-quality material textures.

41.Robust Navigation for Racing Drones based on Imitation Learning and Modularization ⬇️

This paper presents a vision-based modularized drone racing navigation system that uses a customized convolutional neural network (CNN) for the perception module to produce high-level navigation commands and then leverages a state-of-the-art planner and controller to generate low-level control commands, thus exploiting the advantages of both data-based and model-based approaches. Unlike the state-of-the-art method which only takes the current camera image as the CNN input, we further add the latest three drone states as part of the inputs. Our method outperforms the state-of-the-art method in various track layouts and offers two switchable navigation behaviors with a single trained network. The CNN-based perception module is trained to imitate an expert policy that automatically generates ground truth navigation commands based on the pre-computed global trajectories. Owing to the extensive randomization and our modified dataset aggregation (DAgger) policy during data collection, our navigation system, which is purely trained in simulation with synthetic textures, successfully operates in environments with randomly-chosen photorealistic textures without further fine-tuning.

42.Computer Vision and Conflicting Values: Describing People with Automated Alt Text ⬇️

Scholars have recently drawn attention to a range of controversial issues posed by the use of computer vision for automatically generating descriptions of people in images. Despite these concerns, automated image description has become an important tool to ensure equitable access to information for blind and low vision people. In this paper, we investigate the ethical dilemmas faced by companies that have adopted the use of computer vision for producing alt text: textual descriptions of images for blind and low vision people, We use Facebook's automatic alt text tool as our primary case study. First, we analyze the policies that Facebook has adopted with respect to identity categories, such as race, gender, age, etc., and the company's decisions about whether to present these terms in alt text. We then describe an alternative -- and manual -- approach practiced in the museum community, focusing on how museums determine what to include in alt text descriptions of cultural artifacts. We compare these policies, using notable points of contrast to develop an analytic framework that characterizes the particular apprehensions behind these policy choices. We conclude by considering two strategies that seem to sidestep some of these concerns, finding that there are no easy ways to avoid the normative dilemmas posed by the use of computer vision to automate alt text.