[Bug]: Some data used for both train and test when using folder dataset format with random seed #746

yesjuhyeong · 2022-11-30T05:45:27Z

Describe the bug

I'm using Anomalib for anomaly detection.
My custom dataset is very small size and it is not split to train/validation(test).
So, I split data with anomalib/data/utils/split.py "split_normal_images_in_train_set" function.

But, after create self.train_data & self.test_data at anomalib/data/folder.py,
Some data exists in self.train_data and self.test_data.

I think the data used for training is used again for validation when seed=0(random seed) condition.
Because self.train_data and self.test_data is created independently.

self.train_data
https://github.com/openvinotoolkit/anomalib/blob/main/anomalib/data/folder.py#L483

self.test_data
https://github.com/openvinotoolkit/anomalib/blob/main/anomalib/data/folder.py#L512

Check this issue please.

Dataset

Folder

Model

PatchCore

Steps to reproduce the behavior

Install Anomalib
Prepare dataset for folder format (recommend to small size dataset)
Create config file for folder format
Debuging \anomalib\anomalib\data\folder.py
Compare self.train_data and self.test_data

OS information

OS information:

Ubuntu 20.04
Python version: [e.g. 3.8.10]
Anomalib version:0.3.3
PyTorch version: 1.11.0
CUDA/cuDNN version: 11.4
GPU models and configuration: GeForce RTX 2080
Any other relevant information: I'm using a custom dataset

Expected behavior

I want to know about this issue is real bug or my mistake.
If this issue is a bug, please share the fix plan.

Screenshots

No response

Pip/GitHub

GitHub

What version/branch did you use?

Anomalib version:0.3.3

Configuration YAML

dataset:
  name: private_data
  format: folder
  path: /private_data
  task: segmentation
  category: bottle
  image_size: 224
  train_batch_size: 32
  test_batch_size: 32
  num_workers: 8
  transform_config:
    train: null
    val: null
  create_validation_set: false
  tiling:
    apply: false
    tile_size: null
    stride: null
    remove_border_count: 0
    use_random_tiling: False
    random_tile_count: 16

model:
  name: patchcore
  backbone: resnet18
  pre_trained: true
  layers:
    - layer2
    - layer3
  coreset_sampling_ratio: 0.1
  num_neighbors: 9
  normalization_method: min_max # options: [null, min_max, cdf]

metrics:
  image:
    - F1Score
    - AUROC
  pixel:
    - F1Score
    - AUROC
  threshold:
    method: adaptive #options: [adaptive, manual]
    manual_image: null
    manual_pixel: null

visualization:
  show_images: False # show images on the screen
  save_images: True # save images to the file system
  log_images: True # log images to the available loggers (if any)
  image_save_path: null # path to which images will be saved
  mode: full # options: ["full", "simple"]

project:
  seed: 0
  path: ./results

logging:
  logger: [] # options: [comet, tensorboard, wandb, csv] or combinations.
  log_graph: false # Logs the model graph to respective logger.

optimization:
  export_mode: null # options: onnx, openvino

# PL Trainer Args. Don't add extra parameter here.
trainer:
  accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
  accumulate_grad_batches: 1
  amp_backend: native
  auto_lr_find: false
  auto_scale_batch_size: false
  auto_select_gpus: false
  benchmark: false
  check_val_every_n_epoch: 1 # Don't validate before extracting features.
  default_root_dir: null
  detect_anomaly: false
  deterministic: false
  devices: 1
  enable_checkpointing: true
  enable_model_summary: true
  enable_progress_bar: true
  fast_dev_run: false
  gpus: null # Set automatically
  gradient_clip_val: 0
  ipus: null
  limit_predict_batches: 1.0
  limit_test_batches: 1.0
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  log_every_n_steps: 50
  log_gpu_memory: null
  max_epochs: 1
  max_steps: -1
  max_time: null
  min_epochs: null
  min_steps: null
  move_metrics_to_cpu: false
  multiple_trainloader_mode: max_size_cycle
  num_nodes: 1
  num_processes: null
  num_sanity_val_steps: 0
  overfit_batches: 0.0
  plugins: null
  precision: 32
  profiler: null
  reload_dataloaders_every_n_epochs: 0
  replace_sampler_ddp: true
  strategy: null
  sync_batchnorm: false
  tpu_cores: null
  track_grad_norm: -1
  val_check_interval: 1.0 # Don't validate before extracting features.

Logs

Don't need logs for this issue.

Code of Conduct

I agree to follow this project's Code of Conduct

djdameln · 2022-12-01T09:18:07Z

Hi, this was a reported bug in version 0.3.3, which was fixed in v0.3.4. Upgrading your installation of Anomalib to v0.3.4 or higher should resolve your issue.

I'm closing this issue as a duplicate but feel free to re-open if your problems persist after upgrading.

yesjuhyeong · 2022-12-01T10:48:32Z

Thank you @djdameln
I'll upgrade my Anomalib version.

djdameln self-assigned this Dec 1, 2022

djdameln closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Some data used for both train and test when using folder dataset format with random seed #746

[Bug]: Some data used for both train and test when using folder dataset format with random seed #746

yesjuhyeong commented Nov 30, 2022

djdameln commented Dec 1, 2022

yesjuhyeong commented Dec 1, 2022

[Bug]: Some data used for both train and test when using folder dataset format with random seed #746

[Bug]: Some data used for both train and test when using folder dataset format with random seed #746

Comments

yesjuhyeong commented Nov 30, 2022

Describe the bug

Dataset

Model

Steps to reproduce the behavior

OS information

Expected behavior

Screenshots

Pip/GitHub

What version/branch did you use?

Configuration YAML

Logs

Code of Conduct

djdameln commented Dec 1, 2022

yesjuhyeong commented Dec 1, 2022