Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use test.py for custom labelled dataset #2242

Closed
khalidw opened this issue Feb 18, 2021 · 39 comments
Closed

Use test.py for custom labelled dataset #2242

khalidw opened this issue Feb 18, 2021 · 39 comments
Labels
question Further information is requested Stale

Comments

@khalidw
Copy link

khalidw commented Feb 18, 2021

❔Question

Hi! I have a custom dataset of boats (only one class). I have labelled the data myself. I was wondering how can I generate mAP metrics for this data using the pretrained weights (yolov5x.pt).

I understand that test.py can be used to generate mAP metrics but I cannot figure out how can this be done for my own labelled dataset.

Additional context

@khalidw khalidw added the question Further information is requested label Feb 18, 2021
@glenn-jocher
Copy link
Member

glenn-jocher commented Feb 18, 2021

@khalidw mAP is automatically computed using test.py after every epoch during training. See Train Custom Data tutorial to get started:

Tutorials

@khalidw
Copy link
Author

khalidw commented Feb 19, 2021

@glenn-jocher Thanks for your response but it seems like my query was unclear.

  • I have two models: pretrained and custom trained
  • For the custom trained models I have got the mAP which were generated while training
  • Now I want to generate mAP for the pretrained model on my custom dataset
  • Purpose of this is to compare the performance of custom trained and pretrained model on my custom dataset

I am hoping that custom trained model would have better mAP score than the pre-trained model for my custom dataset

@glenn-jocher
Copy link
Member

@khalidw you can use any model you want with test.py by passing it with the --weights argument:
python test.py --data your_data.yaml --weights any_model.pt

If the model was trained on the data, or has intersecting classes with your dataset then you should get some nonzero mAP result.

@khalidw
Copy link
Author

khalidw commented Feb 21, 2021

@glenn-jocher this works perfectly fine with my custom trained models. But I also want to get mAP for pretrained models so that a comparison can be made.

Although mAP for pretrained models are given, but the provided values are for VOC and COCO dataset. I want to generate mAP for pretrained models on my custom dataset.

When I tried to run the test.py for a pretrained model using my custom dataset (boats) I ran into an error. I wasn't expecting this error as my custom dataset and the pretrained models have an intersecting class (boats).

I added an entry in the localDataset/localDataset.yaml as test: localDataset/test.txt which points to the location of test images

I am sharing both the results for custom trained model (error free) and pretrained model (with an error)

mAP for pretrained yolov5x

!python test.py --weights yolov5x.pt --data localDataset/localDataset.yaml --img 640

Namespace(augment=False, batch_size=32, conf_thres=0.001, data='localDataset/localDataset.yaml', device='', exist_ok=False, img_size=640, iou_thres=0.6, name='exp', project='runs/test', save_conf=False, save_hybrid=False, save_json=False, save_txt=False, single_cls=False, task='val', verbose=False, weights=['yolov5x.pt'])
YOLOv5 v4.0-20-ge8a41e8 torch 1.7.0+cu101 CUDA:0 (Tesla T4, 15109.75MB)

Fusing layers... 
Model Summary: 476 layers, 87730285 parameters, 0 gradients, 218.8 GFLOPS
val: Scanning 'localDataset/labels.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100% 100/100 [00:00<00:00, 731990.23it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95:   0% 0/4 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "test.py", line 321, in <module>
    save_conf=opt.save_conf,
  File "test.py", line 182, in test
    confusion_matrix.process_batch(pred, torch.cat((labels[:, 0:1], tbox), 1))
  File "/gdrive/My Drive/object_detection/YOLOv5/utils/metrics.py", line 146, in process_batch
    self.matrix[gc, detection_classes[m1[j]]] += 1  # correct
IndexError: index 8 is out of bounds for axis 1 with size 2

mAP for custom trained yolov5x

# yolov5x
!python test.py --weights runs/train/exp3/weights/best.pt --data localDataset/localDataset.yaml --img 640

Namespace(augment=False, batch_size=32, conf_thres=0.001, data='localDataset/localDataset.yaml', device='', exist_ok=False, img_size=640, iou_thres=0.6, name='exp', project='runs/test', save_conf=False, save_hybrid=False, save_json=False, save_txt=False, single_cls=False, task='val', verbose=False, weights=['runs/train/exp3/weights/best.pt'])
YOLOv5 v4.0-20-ge8a41e8 torch 1.7.0+cu101 CUDA:0 (Tesla T4, 15109.75MB)

Fusing layers... 
Model Summary: 476 layers, 87198694 parameters, 0 gradients, 217.1 GFLOPS
val: Scanning 'localDataset/labels.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100% 100/100 [00:00<00:00, 1061849.11it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 4/4 [00:04<00:00,  1.03s/it]
                 all         100         304       0.731       0.947       0.897       0.463
Speed: 19.7/2.4/22.0 ms inference/NMS/total per 640x640 image at batch-size 32
Results saved to runs/test/exp2

@glenn-jocher
Copy link
Member

@khalidw you can only test models on datasets with identical classes. You test a COCO trained model on COCO only.

@Akhp888
Copy link

Akhp888 commented Feb 22, 2021

confusion_matrix (1)

why doesnt the sum add up to 1 for Predicted scores in confusion matrix ?

@glenn-jocher
Copy link
Member

@Akhp888 the columns are normalized in the confusion matrix, not the rows.

There is also a PR #2114 open with a slightly different confusion matrix implementation that you may want to look at.

@Akhp888
Copy link

Akhp888 commented Feb 23, 2021

@Akhp888 the columns are normalized in the confusion matrix, not the rows.

There is also a PR #2114 open with a slightly different confusion matrix implementation that you may want to look at.

Thanks for the reply @glenn-jocher
Taking into consideration about the normalization you mentioned I still find it hard to interpret the Background FN /FP , can you help me with generalising the scores ?
Also I tried with PR #2114 where i see the axis of miscalculation has reversed but still not something which i could relate .

Thanks

@glenn-jocher
Copy link
Member

@Akhp888 yeah don't worry, the confusion matrix is certainly pretty confusing. In general everyone is used to seeing classification confusion matrices, which are simpler due to lack of the background class we have here.

I'm not sure about row and column normalization both at the same time. Is that even possible? i.e. is there a linear closed form solution for that?

You can try to test at a few different confidence levels to understand how the confusion matrix works (i.e. 0.001, 0.1, 0.9)

@Akhp888
Copy link

Akhp888 commented Feb 25, 2021

@Akhp888 yeah don't worry, the confusion matrix is certainly pretty confusing. In general everyone is used to seeing classification confusion matrices, which are simpler due to lack of the background class we have here.

I'm not sure about row and column normalization both at the same time. Is that even possible? i.e. is there a linear closed form solution for that?

You can try to test at a few different confidence levels to understand how the confusion matrix works (i.e. 0.001, 0.1, 0.9)

I was trying to figure out any way by which i could have same normalization for ground truth values and predicted when i noticed that the inbuilt xyxy2xywh and xywh2xyxy giving me different value than what is expected.
eg : for xyxy = [5994.9658203125, 1397.2547607421875, 6290.80908203125, 1770.0487060546875]
with gn = [32579,2048,32579,2048]
i got xywh = [0.1885536015033722, 0.7732674479484558, 0.009080796502530575, 0.18202829360961914]

where as by manual calculation i was supposed to get [0.1840131931708309, 0.6822533011436462, 0.009080796271179288, 0.18202829360961914]
notice the difference in y value ?

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@edervishaj
Copy link

@khalidw you can only test models on datasets with identical classes. You test a COCO trained model on COCO only.

So does this mean that yolov5 cannot be used for unseen data from another totally different dataset?

@glenn-jocher
Copy link
Member

@edervishaj YOLO models can be trained on any detection dataset and applied to any images. It's your responsibility to ensure that your deployed image space and your training image space are sufficiently similar to achieve a sufficient generalization capability.

Screenshot 2021-04-26 at 23 08 52

@edervishaj
Copy link

Thank you for your reply.

YOLO models can be trained on any detection dataset and applied to any images

Just like OP, I tried running test.py using pretrained weights on a dataset with intersecting classes but ran into an error similar to the one shown by @khalidw.

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 26, 2021

@edervishaj I don't understand what you're asking. For directions creating a proper dataset with train, val, test subsets for training and testing using this repo please see the Train Custom Data tutorial to get started.

YOLOv5 Tutorials

@abuelgasimsaadeldin
Copy link

abuelgasimsaadeldin commented Jul 5, 2021

@edervishaj I don't understand what you're asking. For directions creating a proper dataset with train, val, test subsets for training and testing using this repo please see the Train Custom Data tutorial to get started.

YOLOv5 Tutorials

@glenn-jocher I think what they mean is that they have a set of test images (say they consist of 2 classes: car, truck) both of which are included in the COCO dataset, now they want to run the test.py but instead of using a custom model that is trained on both of these classes, instead they want to achieve the mAP on the test images by using the default yolov5s.pt which was trained on all 80 classes.

@Ankit-Vohra
Copy link

@glenn-jocher I have created a custom model for 2 classes (good and defective). In all I have split my dataset into 3 parts : train, val and test
When I use
python /content/yolov5/test.py --weights /content/yolov5/runs/train/exp3/weights/best.pt --data coco128.yaml --img 640 --augment --half --conf-thres 0.5 --device 0
and my coco128.yaml is


# Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]
path: /content/drive/MyDrive/Spacer_crop_copy  # dataset root dir
train: images/train  # train images (relative to 'path') 128 images
val: images/val  # val images (relative to 'path') 128 images
test: /content/drive/MyDrive/test # test images (optional)

# Classes
nc: 2  # number of classes
names: [ 'Defective', 'Good' ]  # class names

Now the script is only taking val folder into consideration what I want detailed metrics of test folder. Please help

@khalidw
Copy link
Author

khalidw commented Jul 13, 2021

@Ankit-Vohra Simply replace test with val to generate metrics for the test data.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 13, 2021

@Ankit-Vohra you can use the --task argument with test.py to point it to the split you are interested in evaluating:

yolov5/test.py

Line 315 in d204a61

parser.add_argument('--task', default='val', help='train, val, test, speed or study')

i.e.

python test.py --task test

@supratim1121992
Copy link

The test.py file seems to have been removed from the repository. Is it alright to use val.py with --task test instead to obtain evaluation metrics on the test set?

@glenn-jocher
Copy link
Member

Yes

@Latzi
Copy link

Latzi commented Jul 15, 2023

Yes

The val.py doesn't seem to accept --task test ?

@glenn-jocher
Copy link
Member

@Latzi yes, you are correct. The val.py script does not have a --task test argument. However, you can still use val.py to evaluate your model on the test set by running the script without any additional arguments. The script will automatically detect the test set based on your dataset configuration file (--data argument) and generate evaluation metrics for the test set.

@Latzi
Copy link

Latzi commented Jul 15, 2023

@Latzi yes, you are correct. The val.py script does not have a --task test argument. However, you can still use val.py to evaluate your model on the test set by running the script without any additional arguments. The script will automatically detect the test set based on your dataset configuration file (--data argument) and generate evaluation metrics for the test set.

I did run it by adding --task val . I assume the set evauated was the dataset in the val folder ? In my yaml file I have specified train, val and test. So if I understand correctly what you said if I simply run this !python val.py --weights runs/train/exp/weights/best.pt --data Person.yaml the evauated set will be the data specified in the yaml under the test entry ? Is that correct ?

@glenn-jocher
Copy link
Member

@Latzi yes, that is correct. If you run val.py without any additional arguments, it will automatically detect the test set based on your dataset configuration file (--data argument) and generate evaluation metrics for that test set. In your case, since you have specified train, val, and test subsets in your YAML file, running the command python val.py --weights runs/train/exp/weights/best.pt --data Person.yaml will evaluate the model on the test set specified in your YAML file.

@Latzi
Copy link

Latzi commented Jul 15, 2023

@Latzi yes, that is correct. If you run val.py without any additional arguments, it will automatically detect the test set based on your dataset configuration file (--data argument) and generate evaluation metrics for that test set. In your case, since you have specified train, val, and test subsets in your YAML file, running the command python val.py --weights runs/train/exp/weights/best.pt --data Person.yaml will evaluate the model on the test set specified in your YAML file.

Hi @glenn-jocher . Thanks for your answer. It is all clear now :-)

@glenn-jocher
Copy link
Member

@Latzi you're welcome! I'm glad I could help clarify things for you. If you have any further questions or need any more assistance, feel free to ask. Good luck with your project!

@Latzi
Copy link

Latzi commented Jul 16, 2023

@Latzi you're welcome! I'm glad I could help clarify things for you. If you have any further questions or need any more assistance, feel free to ask. Good luck with your project!

Hi @glenn-jocher . Something's up. It doesnt make sense. I did run the val.py without the --task test as we discussed and it did run and all good however when I look at the results I got results on 864 images with 82 instances which are the exact number of images and annotated class instances I have in the val folder ? the test folder has 975 images with 102 instances. So the val.py did run on the images in the val folder not the test folder by the looks of it ? Running the " !python val.py --weights runs/train/exp/weights/best.pt --data Person.yaml" line returns results from the val folder. My yaml file looks like this :

train: ../train_data_0123_4/images/train/ # train images
val: ../train_data_0123_4/images/val/ # val images
test: ../train_data_0123_4/test/

Classes

nc: 1 # number of classes
names: ['Person']

Please help me out with this one

@Latzi
Copy link

Latzi commented Jul 16, 2023

@glenn-jocher If I swap the val with test in the yaml file then the val.py will execute the test folder as I can see the correct number of images and instances now. Also the results obtained by running the val.py on the test folder are significantly higher that if I run the val.py on the val folder. Really big difference ?

Like for the test folder I get val: Scanning /content/train_data_4012_3/test/labels.cache... 975 images, 873 backgrounds, 0 corrupt: 100% 975/975 [00:00<?, ?it/s]
Class Images Instances P R mAP50 mAP50-95: 100% 31/31 [00:07<00:00, 4.00it/s]
all 975 107 0.999 0.991 0.995 0.963
Speed: 0.1ms pre-process, 2.3ms inference, 0.7ms NMS per image at shape (32, 3, 640, 640)
Results saved to runs/val/exp3

while for the val folder the results are :
val: Scanning /content/train_data_4012_3/labels/val.cache... 864 images, 782 backgrounds, 0 corrupt: 100% 864/864 [00:00<?, ?it/s]
Class Images Instances P R mAP50 mAP50-95: 100% 27/27 [00:06<00:00, 4.35it/s]
all 864 82 0.823 0.415 0.509 0.298
Speed: 0.1ms pre-process, 2.2ms inference, 0.8ms NMS per image at shape (32, 3, 640, 640)

Is that normal ? Or I am making some fundamental mistake? The files in the test folder were never seen by the model during training. (I am running a 5 fold cross vlaidation experiment and this 5 th fold is used for the test folder )

@glenn-jocher
Copy link
Member

@Latzi hi there! It seems you're experiencing a difference in the evaluation metrics when running val.py on the "val" and "test" folders using your YAML file. Just to clarify, the "test" folder contains images that were not seen by the model during training.

The reason for the discrepancy in results could be because the model has not been exposed to the test images during training, so it may not have learned to generalize well on this specific unseen data. This can lead to lower performance in terms of precision, recall, and mAP.

Furthermore, please note that evaluation metrics can vary based on the distribution and complexity of the data in the respective folders. It's possible that the test data contains more challenging instances or different scenarios compared to the validation data, resulting in lower performance metrics.

It's essential to evaluate the model on unseen data to gauge its performance on new and unseen samples. The results obtained from evaluating the "test" folder provide a better indication of how the model will perform in real-world scenarios.

If you have any further questions or need additional assistance, please feel free to ask.

@Latzi
Copy link

Latzi commented Jul 16, 2023

@glenn-jocher . The dataset has been split in 5 equal parts. Totally ramndom. One of these parts is called test and has been loaded into the colab env and the yaml file points to tis folder. This test folder has images and lables folder inside. It appears when I test the model I get higher performance metrics (which I am noy complaining about :-) ) except that I dont understand still why this huge difference ? I mean the files I have run the val.py script on (the val folder) has the images which were used during training as te validation set yet I am getting lower mAP values by a lot as you can see above. I don't quite understand why that is. Also the only way to get the val.py to analyze the files in the test folder (unseen by training) is to swap the val and test paths ... That will make the val.py look inside the test folder instead.

I just want to double , triple ,qadruple check that the files in the test folder are not used for training or validation purposes . Right ? Even kf they are inside the train_data folder? the training script only uses train and val folders. Right ?

@glenn-jocher
Copy link
Member

@Latzi the test folder should indeed contain images and labels that were not used during training or validation. It's important for evaluation purposes to have a separate set of unseen data to assess the model's performance on new samples. Swapping the paths in the YAML file (putting the test path in the val field) allows val.py to analyze the files in the test folder.

Regarding the difference in performance metrics between the val and test folders, this can be influenced by various factors. The test data might contain different instances or scenarios compared to the validation data, leading to varying results. The model might not have seen the specific patterns or variations present in the test set, resulting in lower mAP values.

To further investigate, you can examine the specific instances where the model is struggling in the test set. Analyzing false positives and false negatives can provide insights into potential areas for improvement.

Rest assured that the training script only uses the train and val folders for training and validation purposes, respectively. The test folder remains completely unseen by the training process, which aligns with the standard evaluation setup.

If you have any additional questions or concerns, please let me know.

@Latzi
Copy link

Latzi commented Jul 16, 2023

@glenn-jocher . Thank you very uch for yoru anwers. The test set performs fantastic while the val.py run on the validation set give much lower results. Really weird. Also in the val.py I am assuming that a series of evaluations are performed on the images then the detections are compared to the annotations and the numbers compiled at the end forming the values which are the outut of the val.py. The only question is during these evaluations what the confidence level is set for the model ? Whatever was the F1 value during traing ? It is surely not arbitrary. That is the last question. I promisse :-) . I really appreciate your time and help it is awesome !

@glenn-jocher
Copy link
Member

@Latzi the test set performing better than the validation set is indeed an interesting observation. There can be various reasons for this difference, such as the test set containing more challenging instances or different scenarios compared to the validation set. It could also be due to the model not being exposed to the specific patterns or variations present in the test set during training.

Regarding your question about the confidence level during evaluations in val.py, the confidence threshold used for evaluating the model's detections is not based on the F1 value during training. By default, val.py uses a confidence threshold of 0.001, which can be adjusted using the --conf argument. This threshold determines the minimum confidence level required for a detection to be considered valid during evaluation.

I'm glad I could be of help, and I appreciate your kind words. If you have any more questions or need further assistance, please feel free to ask.

@Latzi
Copy link

Latzi commented Jul 16, 2023

Hi @glenn-jocher . Yes I saw the conf value as being 0.001 in the val.py code. But that confuses me now further. As that seems to be super low. Wouldn't that mean that the model will see objects of itereset where there are no such objects? Increasing the false positives trough the roof? I mean during training I had an F1 maxig out at lets say 0.45 . So in my mind if I set the conf factor to 0.45 I should have a realistic picture of what the model performace would be. I mean if I set the model to do the detect.py with a confidence being 0.001 I'd have an avalance of detections most of which would be false positives.

@glenn-jocher
Copy link
Member

@Latzi the confidence threshold used during evaluation in val.py is a value that determines the minimum confidence level required for a detection to be considered valid. A lower threshold, such as 0.001, means that even detections with very low confidence scores will be considered valid.

Setting the confidence threshold too low can indeed result in an increase in false positives, as the model may detect objects where there are none or where the confidence score is very low. It's important to strike a balance when choosing the confidence threshold based on your specific requirements and the desired trade-off between false positives and false negatives.

If you set the confidence factor to 0.45, as the F1 score suggests, it would be a more realistic threshold to consider the model's performance. This would filter out detections with confidence scores lower than 0.45 and provide a clearer picture of the model's accuracy.

I hope this clarifies the role of the confidence threshold in evaluating the model's performance. If you have any further questions, please feel free to ask.

@Latzi
Copy link

Latzi commented Jul 16, 2023

@glenn-jocher That is what I was saying. Setting the confidence treshhold to a more realistic value / closer to what the F1 factor suggest is a better way of testing the performance than having it at the default 0.001. I am doing a 5 fold cross validation exercise and if I run the val.py on 0.001 all will turn into an amorph mass. Running the val.py without setting a conf value whch is realistic is where the confusion came from. It the colab file the vval.py line doesn't have a confidence parameter and I , wrongly , assumed that the confidence would be whatever the F1 suggested it should be as that would give a realistic result. In saying that if that was the case then running the val.py with previously trained models (if the session is staill active) would create another issue because for that particular model the F1 conf might be very different. So for this 5 fold validation exercise I will probably work out the average F1 for the 5 models, and run the val.py for each model on the test set (data not seen during training) and then compare the results and work out the P,R & maP's . The data I have is homogenous / same chunk randomized and it was divided in 5 equal parts where 4 parts used for training and val and the 5th for testing. Given the homogenous nature of the data I expect similar model performances. But it is all good now. Thank you for clarifying, for your time and patience.

@Latzi
Copy link

Latzi commented Jul 16, 2023

I mean I kept reffering to the F1 score , which is not really meaningful really to determine the best confidence level, but for unbalanced class datasets (which is my case) in the past projects I found that a confidence factor near the value the F1 score maxes out is always a good starting point. In any case the n-fold validation results should be compared on the same confidence treshhold to get some meaningful results . All is clear thank you again :-) .

@glenn-jocher
Copy link
Member

@Latzi hi there,

Setting a confidence threshold that aligns with the F1 score is indeed a good starting point, especially for unbalanced class datasets. It can provide a more realistic assessment of the model's performance. In your case, running the validation script with a confidence threshold closer to the F1 score max can help generate meaningful results.

Comparing n-fold validation results using the same confidence threshold is a valid approach to evaluate and compare the performance of different models. It ensures consistency in the evaluation process and allows for a fair comparison.

I'm glad I could help clarify the confusion, and if you have any further questions or need assistance, feel free to ask.

Thank you for reaching out and have a great day!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

8 participants