Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add progress and error reporting API #650

Merged
merged 45 commits into from
Feb 18, 2022

Conversation

zhiltsov-max
Copy link
Contributor

@zhiltsov-max zhiltsov-max commented Feb 3, 2022

Summary

Closes #142

  • Added progress and error reporting in Dataset.import_from and Dataset.export (backward compatible)
  • Added an ability to skip or fail on import and export problems with specific images or annotations
  • Updated COCO, VOC and YOLO formats
  • Improved error messages in formats
  • Now, Extractor ctor has only kw args. Subclasses must define their own c-tor, which accepts 1 positional and any kwargs. Unused kwargs must be passed to the base class.
  • Added Scope.add_many() to enter multiple CMs in 1 call.

How to test

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

@zhiltsov-max zhiltsov-max force-pushed the zm/progress-and-error-reporting branch from d3da45b to cf60478 Compare February 8, 2022 11:25
@zhiltsov-max zhiltsov-max force-pushed the zm/progress-and-error-reporting branch from cf60478 to 64c0e88 Compare February 8, 2022 13:16
@zhiltsov-max zhiltsov-max changed the title [WIP] Add basic progress and error reporting Add progress and error reporting API Feb 8, 2022
Copy link

@IRDonch IRDonch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very limited review so far - I will come back to this.

datumaro/components/progress_reporting.py Outdated Show resolved Hide resolved
datumaro/components/progress_reporting.py Outdated Show resolved Hide resolved
datumaro/components/progress_reporting.py Outdated Show resolved Hide resolved
datumaro/components/extractor.py Outdated Show resolved Hide resolved
datumaro/components/extractor.py Outdated Show resolved Hide resolved
datumaro/components/dataset.py Outdated Show resolved Hide resolved
datumaro/components/extractor.py Outdated Show resolved Hide resolved
datumaro/components/extractor.py Outdated Show resolved Hide resolved
datumaro/components/progress_reporting.py Outdated Show resolved Hide resolved
parsed_annotations=items[img_id].annotations)
except Exception as e:
self._report_annotation_error(e,
item_id=(img_id, self._subset))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this passes the type check, because img_id might not be a string at this point.

Copy link
Contributor Author

@zhiltsov-max zhiltsov-max Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we might even don't have image id here, but it would be better to include anything for troubleshooting. I need to think more how to report it.

datumaro/components/extractor.py Outdated Show resolved Hide resolved
"""
raise NotImplementedError

def start(self, total: int, *, desc: Optional[str] = None):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the way that multiple progress bars are handled needs to be reworked. The current approach has two problems:

  1. There is no way for the client to gauge how much progress has been made in total, because the caller doesn't know how many times the service (by which I mean the component reporting the progress, such as an extractor) will call start. This will be a problem if Datumaro is integrated into a bigger pipeline which only allows reporting progress as a single fraction, such as Accuracy Checker.

  2. It's not possible to have multiple concurrently-updating progress bars, because the report_status and finish functions don't have enough information to determine which bar is being updated.

I propose the following redesign that fixes these problems:

  • Remove the desc argument from start and iter.

  • Add the following method:

    def split(self, subtasks: Sequence[str]) -> Tuple[ProgressReporter, ...]: ...

    This method tells the reporter that the current task consists of several subtasks (each element of subtasks is the description of a subtask) and returns one progress reporter for each subtask.

  • Stipulate that the service must call one and only one of the following methods on its reporter:

    • start - if it can subdivide its work into N homogenous units. If it calls start, then it must use report_status to report on the completion of these units.
    • split - if it can subdivide its work into heterogenous units (subtasks). If it calls split, then no more methods on the original reporter should be called. Instead, the returned reporters must be used by following the rules recursively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it makes sense. It still can be hard to solve the first problem with the proposed design. The second problem looks mostly theoretical at this point.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still can be hard to solve the first problem with the proposed design.

What do you mean? It seems fairly easy to calculate:

  • If the service has called start, then the progress is the current number of completed units divided by the total number of units (if the service didn't specify the total number, then the progress can only be approximated as 0 before the call to finish and 1 after).
  • If the service has called split, then the progress is the sum of the progresses of the child reporters, divided by the number of child reporters.
  • If the service has called neither, the progress is 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is basically correct, but the problem is that if we only have a single pbar, children pbars do one of the following:

  • Commit to the single pbar. Depending if we know total or not, there are mixed variants possible. For instance, after starting a new pbar, we can get a new total and resize the pbar, or we can get the situation that we overfill the previous pbar (if we knew total, but now we don't). If we know all totals, we can sum them, of course, but it won't mean that units are equal.
  • Split pbar into segments and fill them independently
  • Rewrite pbar

I don't consider parallel processing here, but it can be a problem, in theory. Generally, I like the idea, but I'm just saying it's not a silver bullet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented this idea.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but what about about my proposal about moving the description argument to split? I think that makes more sense, because if the service doesn't call split, then it doesn't need to supply a description - the client already knows what it's going to be doing. For example, if the client calls a converter, then the client can always label the progress bar "Exporting dataset", because that's what converters do.

On the other hand, descriptions are necessary when you're splitting the task into subtasks, since the client doesn't know what each subtask represents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, it's reasonable to always split the input pbar, so the caller title is not going to be used at all. I don't have enough arguments for or against requiring titles in split, but it looks more flexible to just ask the number.

datumaro/plugins/coco_format/converter.py Show resolved Hide resolved
error_policy: ImportErrorPolicy = field(default=None,
converter=attr.converters.default_if_none(factory=FailingImportErrorPolicy))

class NullImportContext(ImportContext):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class seems useless. It's only used once and in that instance it could just be replaced by ImportContext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point, it clearly designates the caller intention, while provision of defaults in the base class is more of a forward compatibility.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the nuance here. Is there any circumstance where ImportContext() would act any differently from NullImportContext()? You're just giving the user a pointless choice.

In addition, this enables users to write nonsense like NullImportContext(error_policy=...), which seems undesirable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the nuance here. Is there any circumstance where ImportContext() would act any differently from NullImportContext()? You're just giving the user a pointless choice.

I see it that way that ImportContext is just a base class. It doesn't have to be usable, though it can provide convenient defaults.

In addition, this enables users to write nonsense like NullImportContext(error_policy=...), which seems undesirable.

Why they would need to do this?

datumaro/components/progress_reporting.py Show resolved Hide resolved
extractors.append(env.make_extractor(
src_conf.format, src_conf.url, **src_conf.options
))
# TODO: probably, should not be available in lazy mode, because it
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a TODO? It seems to already be done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that laziness is not an input parameter in the current implementation, and the extractors are not aligned on this topic. Ideally, it should be a function parameter and should also be aligned with the eager_mode CM.

raise NotImplementedError("Must be defined in the subclass")

def fail(self, error: Exception) -> NoReturn:
raise _ImportFail from error
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is _ImportFail necessary? What if this function just raised error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used to indicate exit from importing process. Other exceptions can be captured by other methods of this class.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other exceptions can be captured by other methods of this class.

Isn't that a good thing, though? Imagine that you want to ignore items that failed to be imported for any reason (including bad annotations). You could write:

    def _handle_item_error(self, error: ItemImportError):
        pass

    def _handle_annotation_error(self, error: AnnotationImportError):
        ???

But if _handle_annotation_error calls fail, then the whole import fails; while if it does nothing, the annotation is ignored, and the item is still used. It seems like it would be better if _handle_item_error got a shot at handling the error raised by _handle_annotation_error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. The behavior is going to be quite inconsistent, because we can have any of the following cases in different formats:

[try] load item()
    [try] load annotations()

[try] load item()
[try] load annotations()

[try] load item and annotations()

And if we have eg. handle_ann -> fail, handle_item -> skip, then in the second case we'll fail, then in the first and the third we will skip. If we don't capture errors, we can also do any of them. If we assume the format is "right" and allows this, then yes. Probably, we can just throw error, but then also allow to capture the error in the report_item_error. We still need to wrap errors in dedicated classes to indicate their kinds.

datumaro/components/extractor.py Outdated Show resolved Hide resolved
datumaro/components/progress_reporting.py Show resolved Hide resolved
def _load_items(self, json_data):
pbars = scope_add_many(*self._ctx.progress_reporter.split(2))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more readable to assign each bar to its own variable.

for subset in self._subsets.values():
for item in subset:
@scoped(arg_name='scope') # pylint: disable=no-value-for-parameter
def __iter__(self, *, scope: Scope = None):
Copy link

@IRDonch IRDonch Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary default argument?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there's a bigger problem here - a generator function shouldn't be scoped, since then the closing becomes nondeterministic.

Copy link
Contributor Author

@zhiltsov-max zhiltsov-max Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just suppresses the linter message, because iter cannot have arguments. But we need an argument here (unless we go for code generation), because otherwise it is not being captured by a generator (which appears from yield) (see #661).

Actually, there's a bigger problem here - a generator function shouldn't be scoped, since then the closing becomes nondeterministic.

It is one of the reasons why error and progress reporting cannot be used in lazy mode.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me like another reason why the progress reporter should not be a context manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After experimenting and thinking over, I decided to remove closing from plugins, but restore finish. In the end of the day, it feels natural for pbar (and ui, possible) to manage its children objects. Finishing is still important for signaling the finish of the loading in a pbar. In case of error, we aren't going to have any new pbars. There is still a question if we need a dedicated close (to close an unfinished pbar), but it is not that important now.

datumaro/util/scope.py Outdated Show resolved Hide resolved
datumaro/util/scope.py Outdated Show resolved Hide resolved
def fail(self, error: Exception) -> NoReturn:
raise _ExportFail from error

class FailingExportErrorPolicy(ExportErrorPolicy):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the failing behavior should just be the default? That way, a custom policy that wants to only ignore a certain kind of error will only need to override one method.

It'll also allow other kinds of errors to be added in the future without breaking compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it makes sense. Done

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you pushed it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhiltsov-max zhiltsov-max force-pushed the zm/progress-and-error-reporting branch 2 times, most recently from 9d76d1e to b282d99 Compare February 16, 2022 19:19
@zhiltsov-max zhiltsov-max changed the title Add progress and error reporting API [WIP] Add progress and error reporting API Feb 18, 2022
@zhiltsov-max zhiltsov-max changed the title [WIP] Add progress and error reporting API Add progress and error reporting API Feb 18, 2022
@zhiltsov-max zhiltsov-max merged commit 6070d05 into develop Feb 18, 2022
@zhiltsov-max
Copy link
Contributor Author

I'll merge it. There's definitely a lot of improvements can be done, but they are not critical and can be done later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide a mode to ignore, fix, stop on or report errors in imported datasets
2 participants