Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Dataset implementation #187

Closed
wants to merge 6 commits into from
Closed

Better Dataset implementation #187

wants to merge 6 commits into from

Conversation

hhsecond
Copy link
Member

@hhsecond hhsecond commented Mar 9, 2020

Motivation and Context

A continuation from #62 and should also solve #79

Description

  • DataAccessor class that acts as a lightweight reader
  • Native Dataset
  • Torch Dataset
  • Tensorflow Dataset
  • Support for TF2.0 & TF1.0
  • Make batching possible for native
  • Return namedtuple if find useful
  • Make shuffling possible (relevant for native and TensorFlow)
  • Parallel reads (relevent for native and tensorflow)
  • Rebalancing
  • Dataset split for the test, train and validation
  • Batching nested arrays

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Documentation update
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

  • Ready for review
  • Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

  • Current tests cover modifications made
  • New tests have been added to the test suite
  • Modifications were made to existing tests to support these changes
  • Tests may be needed, but they are not included when the PR was proposed
  • I don't know. Help!

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have signed (or will sign when prompted) the tensorwork CLA.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@hhsecond hhsecond changed the title Better Dataloader implementation Better Dataset implementation Mar 9, 2020
@codecov
Copy link

codecov bot commented Mar 9, 2020

Codecov Report

Merging #187 into master will decrease coverage by 0.33%.
The diff coverage is 86.86%.

@@            Coverage Diff             @@
##           master     #187      +/-   ##
==========================================
- Coverage   95.44%   95.12%   -0.33%     
==========================================
  Files          95       98       +3     
  Lines       15622    15584      -38     
  Branches     1487     1484       -3     
==========================================
- Hits        14910    14823      -87     
- Misses        469      510      +41     
- Partials      243      251       +8     
Impacted Files Coverage Δ
src/hangar/remote/server.py 73.76% <0.00%> (ø)
src/hangar/repository.py 95.48% <ø> (ø)
src/hangar/utils.py 97.39% <ø> (+2.33%) ⬆️
src/hangar/_version.py 96.43% <36.36%> (-3.57%) ⬇️
src/hangar/dataset/tensorflow_dset.py 65.71% <65.71%> (ø)
src/hangar/dataset/numpy_dset.py 67.57% <67.57%> (ø)
src/hangar/dataset/__init__.py 68.75% <68.75%> (ø)
src/hangar/dataset/common.py 69.39% <69.39%> (ø)
src/hangar/dataset/torch_dset.py 85.19% <85.19%> (ø)
src/hangar/columns/layout_flat.py 95.47% <90.00%> (+0.10%) ⬆️
... and 16 more

@hhsecond hhsecond requested a review from rlizzo March 15, 2020 15:42
@hhsecond hhsecond marked this pull request as ready for review March 19, 2020 06:34
if name == 'make_tf_dataset':
from .dataloaders import make_tf_dataset
return make_tf_dataset
def __dir__():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlizzo For the reference, this makes hangar incompatible for python 3.6, AFIK

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have to go back to our old way of lazy loading with classes. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey so now that 0.5 is finally released I'm giving this the time it deserves.

Unfortunately I had to totally revert these changes since I didn't see a way around the py3.6 issues without messing with the sys.path config (a bit no-go for me), and it was even breaking 3.7/3.8 on windows. We've got a fairly efficient solution to a rather difficult problem; it does the job, and is easily understandable, this isn't an area of the codebase that needs to get complicated IMO.

Style points aside, I do think we're generally OK with the status-quo from a usability perspective? The make_foo_dataset methods are exposed explicitly via the __all__ variable. They're available seamlessly in a script, and are visible / introspectable when working in the repl. Have you ran into issues anywhere?

Copy link
Member Author

@hhsecond hhsecond Apr 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__getattr__ & __dir__ on-a-module is introduced in python 3.7. And hence

from hangar import make_numpy_dataset  # will work because it is outside the guard
from hangar import make_torch_dataset  # won't work
from hangar import make_tensorflow_dataset  # won't work

This is my understanding. I haven't tried it out but I am pretty sure this is the case for python 3.6. You are talking about this same issue, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wasn't aware of that PEP, but it seems that we've lucked out.

When I was debugging it I actually saw the opposite

from hangar import make_numpy_dataset  # won't work
from hangar import make_torch_dataset  # works
from hangar import make_tensorflow_dataset  # works 

Wether make_numpy_dataset was introduced in a guard didn't seem to matter (maybe since it would never raise an ImportError?). Nothing I did consistently worked across all the test platforms, so I dropped it.

In any case, I think we lucked out in our current config.

If an attribute is not found on a module object through the normal lookup (i.e. object.__getattribute__), then __getattr__ is searched in the module __dict__ before raising an AttributeError. If found, it is called with the attribute name and the result is returned. Looking up a name as a module global will bypass module __getattr__. This is intentional, otherwise calling __getattr__ for builtins will significantly harm performance.

We set the module level global the first time that __getattr__ is called. This makes alot more sense now about why that was necessary....

If you run across another way to lazyload it might be work looking into, I wasn't thrilled to read

This PEP may break code that uses module level (global) names __getattr__ and __dir__. (But the language reference explicitly reserves all undocumented dunder names, and allows "breakage without warning"; see [3].)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the deal with lazy loading anything? If the user tries to import a file that imports torch and they don't have torch installed, let them get an ImportError. If all this is about trying to massage the namespace so that everything either appears or doesn't in a top level module... Don't do that.

It makes it harder to find the actual source code, and it confuses the issue, since if I'm missing another package, I'd like to get an ImportError that tells me that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer:

  • It's actually a major performance issue if we don't do some type of lazy-loading here.

Longer Answer

In order to have the make_torch_dataset and make_tf_dataset methods available as a top-level import:

from hangar import make_torch_dataset 
# compared to
from hangar.datasets.toch_loader import make_torch_dataset 

we have to make them "importable" in the top-level __init__.py file. Since torch and tensorflow are optional dependencies, we have to at-least guard against environments missing one or both of the packages from raising ImportError / ModuleNotFoundError as hangar is imported.

That in itself is fairly trivial to do. If the module exists then great, it gets imported as normal; if not, you can either mock it out, or use some importlib/inspect magic to cut the execution path short....

The issue for us isn't actually with environments that don't have torch or tensorflow installed. It's for the ones that do.*

These are really huge packages with a ton of dependencies themselves... Importing them takes a noticeably long time, and having such massive dependencies sitting in RAM and polluting the package namespace accrues a non-trivial slowdown over time.

We use the "lazy-loaders" to stop the execution path of top-level hangar imports short of any file containing import torch or import tensorflow statments (unless explicitly requested). This occurs irregardless of if torch or tensorflow are installed in the environment. The cost of keeping those packages around should only be paid if their actually being used :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlizzo I understand that you had reverted the guard with __getattr__ but found that the functions are moved from the top level to inside dataset

from hangar.dataset import make_xxx_dataset

Was that intentional. I'd still keep it on the top

Copy link
Member

@rlizzo rlizzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I like where this is heading @hhsecond! Nice job!

As you'll undoubtedly notice, I've put a bit of time into this in the last two days... It's actually been relaxing to focus on something sane after the automation hell I've been through to get 0.5 built correctly on all the different platforms this past week...

A few points:

1. On the rename:

I really like this! dataset makes so much more sense than dataloaders. We need to push the "a dataset is just a view of pieces in various columns` message hard. I think this will help.

2. Performance:

I like how lean you've kept this, it's a modest piece of code doing a fairly large bit of work. Digging in I got a bit worried about the time/memory cost tho. Making this performant when dealing with large repos is going to be our main challenge going forward. We need to make sure this doesn't collapse if repos have many columns which each might contain hundrends of thousands / millions of samples.

  • I've already taken a first pass at the HangarDataset class. Generators + set comprehensions / in-place updates seems to be the way to go. I only got through the one module so far, but a few benchmarks show significant speed/memory savings.... While there's a few more optimizations to make in that module alone, I'm fairly certain that we'll need to be critical of most every step in the call chain here.

  • I saw a note you left about trying to use something leaner that individual column instances to load the data. Great idea! We'll want a column around up front for a few convenience methods and utilities in the preprocessing / view creation, but once the samples have been split and batched, the column is absolutly not needed. I'll help out with this one, but it's as simple as building a few dictionaries mapping samples to backend locations, opening the backend file handles of interest with

def open_file_handles(backends, path, mode, schema, *, remote_operation=False):
"""Open backend accessor file handles for reading
Parameters
----------
backends : Set[str]
if ``mode == 'r'`` then this should be the used backend format
codes in the column. if ``mode == 'a'``, then this should be a
list of the allowed backend format codes this schema can feasably
write to.
path : Path
path to the hangar repository on disk
mode : str
one of ['r', 'a'] indicating read or write mode to open backends in.
schema : ColumnDefinitionTypes
schema spec so required values can be filled in to backend openers.
Returns
-------
AccessorMapType
dict mapping backend format codes to initialized instances of each
read-only backend.
"""

and then calling the same 2 lines of code we use in the column classes to read from disk.

spec = self._samples[key]
# self._be_fs = dict['backend_codes'] -> backend reader instances.
return self._be_fs[spec.backend].read_data(spec)

3. API:

There are two issues I see here:

  1. Collating data in different columns which have different types.
  2. Specifying views across columns which have different layouts.

It took a minute to see, but these are the exact same problems we've solved in the rest of Hangar's API..

1. Column/Schema Types

We've already solved the "incompatible data type" problem with a plugable system of column schemas... Since each schema knows exactally what it is and how to process data passing through the column API, we can rely on it to collate/stack batches of data correctly.

  • for the ndarray columns, we would of course just stack arrays along a new access to create the batch...
  • we should do the same with the str typed column schemas, instead of calling np.stack, it should be a list.append() operation (or something else which makes sense for how python strings are grouped together in a container, it doesn't really matter...)
  • The point is that we can't stick an 'np.ndarray' AND 'str' typed piece of data together in any way that makes sense. What we can do is put each piece of data together appropriately for the type of column that it is stored in.

As far as for how it's returned to the user, we handle that the same way we handle everything else: order in == order out. If you specify

dset = make_numpy_dataset(
    columns=['train_images', 'train_labels'], 
    keys=('foo', 'bar', 'baz', 'foo', 'bar2'), 
    batch_size=2, 
    drop_last=False)

for data_batch in dset:
    print(data_batch)

It should output something like the following in a regular tuple. where len(data_batch) == len(columns) and the order of output column results is exactally the order in which they appeared in the input columns argument to make_foo_dataset().

The order of samples returned in each batch - and it's index in the batch - should exactaly match the ordering of their position in the input keys sequence unless shuffle=True is specified. We don't deduplicate requests for the same keys, we don't mutate order or structure in any way unless explicitly asked to. For example:

([[img_foo_data], [img_bar_data]],  [label_foo_data, label_bar_data])
([[img_baz_data], [img_foo_data]],  [label_baz_data, label_foo_data])
([img_bar2_data], [label_bar2_data])

If we preserve order unless explicitly instructed otherwise we can eliminate a lot of hardship right up front: simple to use and understand, as well as highly customizable based on the end user application.

In regards to the type of container the output is stored in:

namedtuple is a no-go since dynamic definition is a disaster. The choice is between tuples and dicts. I'm strongly in favor of tuples (so long as we preserve ordering like I outlined above),. I don't see a clear need to return a mapping in most cases, and creating many thousands of dicts is actually a non-trivial expense if we don't need to use them. I think returning tuples as a default and including an option to return a dict instead would provide enough flexibility for most people.?

2. Specifying Views / Column Layouts

The column layouts were created with the express purpose of allowing for access/usage paterns which are wholly incompatible. There may be similarities and best practices, but at the end of the day we can't force a single-level store to be used in the same way a nested dictionary would be.

I'm going to propose we use a slightly modified version of the syntax used to retrieve a view of some sample / columns from the checkout object.

>>> dset[('nested_col', 'sample_1', '0'), ('foo', '0')]
[np.array([1, 0]), np.array([0])]

>>> dset[('nested_col', 'sample_1', ...), ('foo', '0')]
[{'subsample_0': np.array([1, 0]), 'subsample_1': np.array([1, 1])},
 np.array([0])]

>>> dset[('foo', '0'), ('nested_col', 'foo', 0:3)]
[np.array([0]),
 {'subsample_0': np.array([1, 0]),
  'subsample_1': np.array([1, 1]),
  'subsample_2': np.array([2, 2])}]

But instead of having to specify the column every time, the sample keys for one view would be nested in a list and matched to their column by index. Something like this?

>>> make_numpy_dataset(column=('nested_col', 'foo'),  keys=[(('sample_1', '0'), '0'),])
 
>>> make_numpy_dataset(
...     column=('nested_col', 'foo'),
...     keys=[(('sample_1', ...), '1'), (('sample_2', ...), '2')]
... )

# it's easy enough to generate keys this way as well.
>>> generated_keys = []
>>> for sample in train_col.keys():
...     samp_key = ((sample, ...), sample, ('train_category', sample))
...     generated_keys.append(samp_key)
>>>
>>> make_numpy_dataset(('nested_col', 'foo', 'nested_categories'), keys=generated_keys])

Still a bit of a WIP, but it'd be nice to be consistent across the platform and this seems like a reasonable solution.

At any rate, I think we could knock this out in a short bit and push it to 0.5.1? Let me know your thoughts!

@rlizzo rlizzo mentioned this pull request Apr 8, 2020
18 tasks
@CLAassistant
Copy link

CLAassistant commented Apr 10, 2020

CLA assistant check
All committers have signed the CLA.

rlizzo and others added 6 commits July 22, 2020 07:44
* New Hangar Quick Start Tutorial - download and organize Imagenette data

* Add Column initialization, Adding data and Committing changes section

* Fix validation data loading, add more comments

* Add Alessia as author

* Change name of the checkout object, add a note for larger dataset reading

* Update Tutorial-QuickStart.ipynb

* Fix paths and commit id

* Change title, minor fixes

* Fix typo

* Minor fixes

Co-authored-by: Rick Izzo <rlizzo@users.noreply.github.com>
* updated changelog

* Bump version: 0.5.1 → 0.5.2
Experimental decorator, documentation, type checking

torch dataset can return OrderedDict or tuple

guarding import exceptions for machine learning libraries that may not exist on user system

refactorings and a few performance considerations to make working with batching keys easier.
@hhsecond hhsecond mentioned this pull request Jul 22, 2020
29 tasks
@hhsecond
Copy link
Member Author

I found that the commit history is somehow screwed up and not fixable without careful analysis and probably that's not worth it. I am closing this PR in favor of creating a #206

@hhsecond hhsecond closed this Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants