Add progress and error reporting API #650

zhiltsov-max · 2022-02-03T17:19:03Z

Summary

Closes #142

Added progress and error reporting in Dataset.import_from and Dataset.export (backward compatible)
Added an ability to skip or fail on import and export problems with specific images or annotations
Updated COCO, VOC and YOLO formats
Improved error messages in formats
Now, Extractor ctor has only kw args. Subclasses must define their own c-tor, which accepts 1 positional and any kwargs. Unused kwargs must be passed to the base class.
~~Added Scope.add_many() to enter multiple CMs in 1 call.~~

How to test

Checklist

I submit my changes into the develop branch
I have added description of my changes into CHANGELOG
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below)

# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

…openvinotoolkit/datumaro into zm/progress-and-error-reporting

IRDonch

Very limited review so far - I will come back to this.

datumaro/components/progress_reporting.py

datumaro/components/extractor.py

datumaro/components/dataset.py

datumaro/components/extractor.py

datumaro/components/progress_reporting.py

IRDonch · 2022-02-10T15:54:54Z

datumaro/plugins/coco_format/extractor.py

+                        parsed_annotations=items[img_id].annotations)
+                except Exception as e:
+                    self._report_annotation_error(e,
+                        item_id=(img_id, self._subset))


I don't think this passes the type check, because img_id might not be a string at this point.

Actually, we might even don't have image id here, but it would be better to include anything for troubleshooting. I need to think more how to report it.

datumaro/components/extractor.py

IRDonch · 2022-02-10T16:47:06Z

datumaro/components/progress_reporting.py

+        """
+        raise NotImplementedError
+
+    def start(self, total: int, *, desc: Optional[str] = None):


IMO, the way that multiple progress bars are handled needs to be reworked. The current approach has two problems:

There is no way for the client to gauge how much progress has been made in total, because the caller doesn't know how many times the service (by which I mean the component reporting the progress, such as an extractor) will call start. This will be a problem if Datumaro is integrated into a bigger pipeline which only allows reporting progress as a single fraction, such as Accuracy Checker.

It's not possible to have multiple concurrently-updating progress bars, because the report_status and finish functions don't have enough information to determine which bar is being updated.

I propose the following redesign that fixes these problems:

Remove the desc argument from start and iter.

Add the following method:

def split(self, subtasks: Sequence[str]) -> Tuple[ProgressReporter, ...]: ...

This method tells the reporter that the current task consists of several subtasks (each element of subtasks is the description of a subtask) and returns one progress reporter for each subtask.

Stipulate that the service must call one and only one of the following methods on its reporter:

start - if it can subdivide its work into N homogenous units. If it calls start, then it must use report_status to report on the completion of these units.

split - if it can subdivide its work into heterogenous units (subtasks). If it calls split, then no more methods on the original reporter should be called. Instead, the returned reporters must be used by following the rules recursively.

Yeah, it makes sense. It still can be hard to solve the first problem with the proposed design. The second problem looks mostly theoretical at this point.

It still can be hard to solve the first problem with the proposed design.

What do you mean? It seems fairly easy to calculate:

If the service has called start, then the progress is the current number of completed units divided by the total number of units (if the service didn't specify the total number, then the progress can only be approximated as 0 before the call to finish and 1 after).

If the service has called split, then the progress is the sum of the progresses of the child reporters, divided by the number of child reporters.

If the service has called neither, the progress is 0.

Yes, it is basically correct, but the problem is that if we only have a single pbar, children pbars do one of the following:

Commit to the single pbar. Depending if we know total or not, there are mixed variants possible. For instance, after starting a new pbar, we can get a new total and resize the pbar, or we can get the situation that we overfill the previous pbar (if we knew total, but now we don't). If we know all totals, we can sum them, of course, but it won't mean that units are equal.

Split pbar into segments and fill them independently

Rewrite pbar

I don't consider parallel processing here, but it can be a problem, in theory. Generally, I like the idea, but I'm just saying it's not a silver bullet.

I've implemented this idea.

Okay, but what about about my proposal about moving the description argument to split? I think that makes more sense, because if the service doesn't call split, then it doesn't need to supply a description - the client already knows what it's going to be doing. For example, if the client calls a converter, then the client can always label the progress bar "Exporting dataset", because that's what converters do.

On the other hand, descriptions are necessary when you're splitting the task into subtasks, since the client doesn't know what each subtask represents.

Probably, it's reasonable to always split the input pbar, so the caller title is not going to be used at all. I don't have enough arguments for or against requiring titles in split, but it looks more flexible to just ask the number.

datumaro/plugins/coco_format/converter.py

IRDonch · 2022-02-15T12:03:47Z

datumaro/components/extractor.py

+    error_policy: ImportErrorPolicy = field(default=None,
+        converter=attr.converters.default_if_none(factory=FailingImportErrorPolicy))
+
+class NullImportContext(ImportContext):


This class seems useless. It's only used once and in that instance it could just be replaced by ImportContext.

From my point, it clearly designates the caller intention, while provision of defaults in the base class is more of a forward compatibility.

I don't really understand the nuance here. Is there any circumstance where ImportContext() would act any differently from NullImportContext()? You're just giving the user a pointless choice.

In addition, this enables users to write nonsense like NullImportContext(error_policy=...), which seems undesirable.

I don't really understand the nuance here. Is there any circumstance where ImportContext() would act any differently from NullImportContext()? You're just giving the user a pointless choice.

I see it that way that ImportContext is just a base class. It doesn't have to be usable, though it can provide convenient defaults.

In addition, this enables users to write nonsense like NullImportContext(error_policy=...), which seems undesirable.

Why they would need to do this?

datumaro/components/progress_reporting.py

IRDonch · 2022-02-15T12:30:16Z

datumaro/components/dataset.py

-            extractors.append(env.make_extractor(
-                src_conf.format, src_conf.url, **src_conf.options
-            ))
+        # TODO: probably, should not be available in lazy mode, because it


Why is this a TODO? It seems to already be done.

The idea is that laziness is not an input parameter in the current implementation, and the extractors are not aligned on this topic. Ideally, it should be a function parameter and should also be aligned with the eager_mode CM.

IRDonch · 2022-02-15T12:37:38Z

datumaro/components/extractor.py

+        raise NotImplementedError("Must be defined in the subclass")
+
+    def fail(self, error: Exception) -> NoReturn:
+        raise _ImportFail from error


Why is _ImportFail necessary? What if this function just raised error?

It is used to indicate exit from importing process. Other exceptions can be captured by other methods of this class.

Other exceptions can be captured by other methods of this class.

Isn't that a good thing, though? Imagine that you want to ignore items that failed to be imported for any reason (including bad annotations). You could write:

def _handle_item_error(self, error: ItemImportError): pass def _handle_annotation_error(self, error: AnnotationImportError): ???

But if _handle_annotation_error calls fail, then the whole import fails; while if it does nothing, the annotation is ignored, and the item is still used. It seems like it would be better if _handle_item_error got a shot at handling the error raised by _handle_annotation_error.

This is a good question. The behavior is going to be quite inconsistent, because we can have any of the following cases in different formats:

[try] load item() [try] load annotations() [try] load item() [try] load annotations() [try] load item and annotations()

And if we have eg. handle_ann -> fail, handle_item -> skip, then in the second case we'll fail, then in the first and the third we will skip. If we don't capture errors, we can also do any of them. If we assume the format is "right" and allows this, then yes. Probably, we can just throw error, but then also allow to capture the error in the report_item_error. We still need to wrap errors in dedicated classes to indicate their kinds.

datumaro/components/extractor.py

datumaro/components/progress_reporting.py

IRDonch · 2022-02-15T12:47:09Z

datumaro/plugins/coco_format/extractor.py

    def _load_items(self, json_data):
+        pbars = scope_add_many(*self._ctx.progress_reporter.split(2))


It would be more readable to assign each bar to its own variable.

IRDonch · 2022-02-15T12:49:34Z

datumaro/plugins/yolo_format/extractor.py

-        for subset in self._subsets.values():
-            for item in subset:
+    @scoped(arg_name='scope') # pylint: disable=no-value-for-parameter
+    def __iter__(self, *, scope: Scope = None):


Unnecessary default argument?

Actually, there's a bigger problem here - a generator function shouldn't be scoped, since then the closing becomes nondeterministic.

It just suppresses the linter message, because iter cannot have arguments. But we need an argument here (unless we go for code generation), because otherwise it is not being captured by a generator (which appears from yield) (see #661).

Actually, there's a bigger problem here - a generator function shouldn't be scoped, since then the closing becomes nondeterministic.

It is one of the reasons why error and progress reporting cannot be used in lazy mode.

Seems to me like another reason why the progress reporter should not be a context manager.

After experimenting and thinking over, I decided to remove closing from plugins, but restore finish. In the end of the day, it feels natural for pbar (and ui, possible) to manage its children objects. Finishing is still important for signaling the finish of the loading in a pbar. In case of error, we aren't going to have any new pbars. There is still a question if we need a dedicated close (to close an unfinished pbar), but it is not that important now.

datumaro/util/scope.py

IRDonch · 2022-02-15T13:34:05Z

datumaro/components/converter.py

+    def fail(self, error: Exception) -> NoReturn:
+        raise _ExportFail from error
+
+class FailingExportErrorPolicy(ExportErrorPolicy):


Perhaps the failing behavior should just be the default? That way, a custom policy that wants to only ignore a certain kind of error will only need to override one method.

It'll also allow other kinds of errors to be added in the future without breaking compatibility.

Yes, it makes sense. Done

I don't think you pushed it.

No, it's in
Make failing the default action
8736ea2

…ror-reporting

zhiltsov-max · 2022-02-18T12:58:27Z

I'll merge it. There's definitely a lot of improvements can be done, but they are not critical and can be done later.

Maxim Zhiltsov added 2 commits February 8, 2022 13:35

Add reporting API and tests

a2de270

Add reporting in COCO import and export

fe9b2d3

zhiltsov-max force-pushed the zm/progress-and-error-reporting branch from d3da45b to cf60478 Compare February 8, 2022 11:25

Maxim Zhiltsov added 3 commits February 8, 2022 15:57

Add reporting in VOC

e8d2d58

Add reporting in YOLO

64c0e88

Update changelog

a644e49

zhiltsov-max force-pushed the zm/progress-and-error-reporting branch from cf60478 to 64c0e88 Compare February 8, 2022 13:16

zhiltsov-max changed the title ~~[WIP] Add basic progress and error reporting~~ Add progress and error reporting API Feb 8, 2022

Maxim Zhiltsov added 5 commits February 8, 2022 17:00

Add early exit

d110696

Add error reporting in converters

d0e8726

Merge branch 'develop' into zm/progress-and-error-reporting

559d73b

Fix tests

c091f99

Merge branch 'zm/progress-and-error-reporting' of https://github.com/…

009a0e7

…openvinotoolkit/datumaro into zm/progress-and-error-reporting

zhiltsov-max requested a review from IRDonch February 8, 2022 15:05

IRDonch reviewed Feb 9, 2022

View reviewed changes

Maxim Zhiltsov added 6 commits February 10, 2022 15:13

Add docs

cb9de07

Fixes in reporing

93ab12e

Add more docs

e9ab7c3

Merge branch 'develop' into zm/progress-and-error-reporting

267d4dd

Fix finish use

ebf8796

Add more type annotations

149d33d

zhiltsov-max mentioned this pull request Feb 10, 2022

[WIP] Progress and error reporting in CLI #662

Closed

7 tasks

IRDonch suggested changes Feb 10, 2022

View reviewed changes

datumaro/components/dataset.py Outdated Show resolved Hide resolved

datumaro/components/extractor.py Outdated Show resolved Hide resolved

datumaro/components/extractor.py Outdated Show resolved Hide resolved

datumaro/components/progress_reporting.py Outdated Show resolved Hide resolved

IRDonch reviewed Feb 10, 2022

View reviewed changes

Maxim Zhiltsov added 4 commits February 11, 2022 12:15

Replace container with sequence

ef01c4c

Add method docstrings and remove noreturn

f8563c6

Adjust coco item error reporting

bf6e185

Implement splitting for pbars, add deprection message, update formats

92c59ab

zhiltsov-max force-pushed the zm/progress-and-error-reporting branch from 69f0c37 to 92c59ab Compare February 11, 2022 15:04

Remove extra call

bb86d40

Maxim Zhiltsov added 5 commits February 15, 2022 10:27

Rename arg

deb4395

Fix error class

9d8ff4d

Add test for add_many

c049a64

Make frequency property

1895e88

Add error docs

61b3ef7

IRDonch reviewed Feb 15, 2022

View reviewed changes

Maxim Zhiltsov added 5 commits February 15, 2022 18:06

Rename frequency to period

2e02744

Make failing the default action

8736ea2

Fix .dvc/plots removal

791c9e4

Merge branch 'zm/fix-file-removal-in-project' into zm/progress-and-er…

fcb3ec0

…ror-reporting

Update API symbols

8ea2352

zhiltsov-max force-pushed the zm/progress-and-error-reporting branch from 5700217 to 8ea2352 Compare February 16, 2022 14:15

Expect a sequence in add_many

b282d99

zhiltsov-max force-pushed the zm/progress-and-error-reporting branch 2 times, most recently from 9d76d1e to b282d99 Compare February 16, 2022 19:19

Maxim Zhiltsov added 5 commits February 17, 2022 19:55

Report image id reading failures in COCO

63d9768

Add descriptions to Image ctor errors

2c9d7d2

Rename some methods, restore finish

0b71d63

Use ValueError instead of plain Exception

e86d514

Remove pbar management from plugins

56baf5e

zhiltsov-max changed the title ~~Add progress and error reporting API~~ [WIP] Add progress and error reporting API Feb 18, 2022

Maxim Zhiltsov added 3 commits February 18, 2022 13:13

Fix export

4ed51f4

Fix errors

31b226a

Revert pylintrc

c0cb60f

zhiltsov-max changed the title ~~[WIP] Add progress and error reporting API~~ Add progress and error reporting API Feb 18, 2022

zhiltsov-max merged commit 6070d05 into develop Feb 18, 2022

zhiltsov-max deleted the zm/progress-and-error-reporting branch March 18, 2022 19:48

zhiltsov-max mentioned this pull request Aug 21, 2022

Allow skip annotations with problems on import/export cvat-ai/cvat#4811

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add progress and error reporting API #650

Add progress and error reporting API #650

zhiltsov-max commented Feb 3, 2022 •

edited

Loading

IRDonch left a comment

IRDonch Feb 10, 2022

zhiltsov-max Feb 11, 2022 •

edited

Loading

IRDonch Feb 10, 2022

zhiltsov-max Feb 11, 2022

IRDonch Feb 11, 2022

zhiltsov-max Feb 11, 2022

zhiltsov-max Feb 11, 2022

IRDonch Feb 15, 2022

zhiltsov-max Feb 15, 2022

IRDonch Feb 15, 2022

zhiltsov-max Feb 15, 2022

IRDonch Feb 16, 2022

zhiltsov-max Feb 16, 2022

IRDonch Feb 15, 2022

zhiltsov-max Feb 15, 2022

IRDonch Feb 15, 2022

zhiltsov-max Feb 15, 2022

IRDonch Feb 15, 2022

zhiltsov-max Feb 15, 2022

IRDonch Feb 15, 2022

IRDonch Feb 15, 2022 •

edited

Loading

IRDonch Feb 15, 2022

zhiltsov-max Feb 15, 2022 •

edited

Loading

IRDonch Feb 16, 2022

zhiltsov-max Feb 18, 2022

IRDonch Feb 15, 2022

zhiltsov-max Feb 16, 2022

IRDonch Feb 16, 2022

zhiltsov-max Feb 16, 2022

zhiltsov-max commented Feb 18, 2022

		def _load_items(self, json_data):
		pbars = scope_add_many(*self._ctx.progress_reporter.split(2))

Add progress and error reporting API #650

Add progress and error reporting API #650

Conversation

zhiltsov-max commented Feb 3, 2022 • edited Loading

Summary

How to test

Checklist

License

IRDonch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhiltsov-max Feb 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IRDonch Feb 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhiltsov-max Feb 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhiltsov-max commented Feb 18, 2022

zhiltsov-max commented Feb 3, 2022 •

edited

Loading

zhiltsov-max Feb 11, 2022 •

edited

Loading

IRDonch Feb 15, 2022 •

edited

Loading

zhiltsov-max Feb 15, 2022 •

edited

Loading