Better Dataset implementation #187

hhsecond · 2020-03-09T13:06:16Z

Motivation and Context

A continuation from #62 and should also solve #79

Description

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Documentation update
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

Ready for review
Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

Current tests cover modifications made
New tests have been added to the test suite
Modifications were made to existing tests to support these changes
Tests may be needed, but they are not included when the PR was proposed
I don't know. Help!

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have signed (or will sign when prompted) the tensorwork CLA.
I have added tests to cover my changes.
All new and existing tests passed.

codecov · 2020-03-09T13:14:38Z

Codecov Report

Merging #187 into master will decrease coverage by 0.33%.
The diff coverage is 86.86%.

@@            Coverage Diff             @@
##           master     #187      +/-   ##
==========================================
- Coverage   95.44%   95.12%   -0.33%     
==========================================
  Files          95       98       +3     
  Lines       15622    15584      -38     
  Branches     1487     1484       -3     
==========================================
- Hits        14910    14823      -87     
- Misses        469      510      +41     
- Partials      243      251       +8

Impacted Files	Coverage Δ
src/hangar/remote/server.py	`73.76% <0.00%> (ø)`
src/hangar/repository.py	`95.48% <ø> (ø)`
src/hangar/utils.py	`97.39% <ø> (+2.33%)`	⬆️
src/hangar/_version.py	`96.43% <36.36%> (-3.57%)`	⬇️
src/hangar/dataset/tensorflow_dset.py	`65.71% <65.71%> (ø)`
src/hangar/dataset/numpy_dset.py	`67.57% <67.57%> (ø)`
src/hangar/dataset/__init__.py	`68.75% <68.75%> (ø)`
src/hangar/dataset/common.py	`69.39% <69.39%> (ø)`
src/hangar/dataset/torch_dset.py	`85.19% <85.19%> (ø)`
src/hangar/columns/layout_flat.py	`95.47% <90.00%> (+0.10%)`	⬆️
... and 16 more

src/hangar/dataset/common.py

src/hangar/dataset/torch.py

hhsecond · 2020-04-04T16:53:09Z

src/hangar/__init__.py

-    if name == 'make_tf_dataset':
-        from .dataloaders import make_tf_dataset
-        return make_tf_dataset
+def __dir__():


@rlizzo For the reference, this makes hangar incompatible for python 3.6, AFIK

We might have to go back to our old way of lazy loading with classes. What do you think?

Hey so now that 0.5 is finally released I'm giving this the time it deserves.

Unfortunately I had to totally revert these changes since I didn't see a way around the py3.6 issues without messing with the sys.path config (a bit no-go for me), and it was even breaking 3.7/3.8 on windows. We've got a fairly efficient solution to a rather difficult problem; it does the job, and is easily understandable, this isn't an area of the codebase that needs to get complicated IMO.

Style points aside, I do think we're generally OK with the status-quo from a usability perspective? The make_foo_dataset methods are exposed explicitly via the __all__ variable. They're available seamlessly in a script, and are visible / introspectable when working in the repl. Have you ran into issues anywhere?

__getattr__ & __dir__ on-a-module is introduced in python 3.7. And hence

from hangar import make_numpy_dataset # will work because it is outside the guard from hangar import make_torch_dataset # won't work from hangar import make_tensorflow_dataset # won't work

This is my understanding. I haven't tried it out but I am pretty sure this is the case for python 3.6. You are talking about this same issue, right?

I actually wasn't aware of that PEP, but it seems that we've lucked out.

When I was debugging it I actually saw the opposite

from hangar import make_numpy_dataset # won't work from hangar import make_torch_dataset # works from hangar import make_tensorflow_dataset # works

Wether make_numpy_dataset was introduced in a guard didn't seem to matter (maybe since it would never raise an ImportError?). Nothing I did consistently worked across all the test platforms, so I dropped it.

In any case, I think we lucked out in our current config.

If an attribute is not found on a module object through the normal lookup (i.e. object.__getattribute__), then __getattr__ is searched in the module __dict__ before raising an AttributeError. If found, it is called with the attribute name and the result is returned. Looking up a name as a module global will bypass module __getattr__. This is intentional, otherwise calling __getattr__ for builtins will significantly harm performance.

We set the module level global the first time that __getattr__ is called. This makes alot more sense now about why that was necessary....

If you run across another way to lazyload it might be work looking into, I wasn't thrilled to read

This PEP may break code that uses module level (global) names __getattr__ and __dir__. (But the language reference explicitly reserves all undocumented dunder names, and allows "breakage without warning"; see [3].)

What's the deal with lazy loading anything? If the user tries to import a file that imports torch and they don't have torch installed, let them get an ImportError. If all this is about trying to massage the namespace so that everything either appears or doesn't in a top level module... Don't do that.

It makes it harder to find the actual source code, and it confuses the issue, since if I'm missing another package, I'd like to get an ImportError that tells me that.

Short answer:

It's actually a major performance issue if we don't do some type of lazy-loading here.

Longer Answer

In order to have the make_torch_dataset and make_tf_dataset methods available as a top-level import:

from hangar import make_torch_dataset # compared to from hangar.datasets.toch_loader import make_torch_dataset

we have to make them "importable" in the top-level __init__.py file. Since torch and tensorflow are optional dependencies, we have to at-least guard against environments missing one or both of the packages from raising ImportError / ModuleNotFoundError as hangar is imported.

That in itself is fairly trivial to do. If the module exists then great, it gets imported as normal; if not, you can either mock it out, or use some importlib/inspect magic to cut the execution path short....

The issue for us isn't actually with environments that don't have torch or tensorflow installed. It's for the ones that do.*

These are really huge packages with a ton of dependencies themselves... Importing them takes a noticeably long time, and having such massive dependencies sitting in RAM and polluting the package namespace accrues a non-trivial slowdown over time.

We use the "lazy-loaders" to stop the execution path of top-level hangar imports short of any file containing import torch or import tensorflow statments (unless explicitly requested). This occurs irregardless of if torch or tensorflow are installed in the environment. The cost of keeping those packages around should only be paid if their actually being used :)

@rlizzo I understand that you had reverted the guard with __getattr__ but found that the functions are moved from the top level to inside dataset

from hangar.dataset import make_xxx_dataset

Was that intentional. I'd still keep it on the top

rlizzo

So I like where this is heading @hhsecond! Nice job!

As you'll undoubtedly notice, I've put a bit of time into this in the last two days... It's actually been relaxing to focus on something sane after the automation hell I've been through to get 0.5 built correctly on all the different platforms this past week...

A few points:

1. On the rename:

I really like this! dataset makes so much more sense than dataloaders. We need to push the "a dataset is just a view of pieces in various columns` message hard. I think this will help.

2. Performance:

I like how lean you've kept this, it's a modest piece of code doing a fairly large bit of work. Digging in I got a bit worried about the time/memory cost tho. Making this performant when dealing with large repos is going to be our main challenge going forward. We need to make sure this doesn't collapse if repos have many columns which each might contain hundrends of thousands / millions of samples.

I've already taken a first pass at the HangarDataset class. Generators + set comprehensions / in-place updates seems to be the way to go. I only got through the one module so far, but a few benchmarks show significant speed/memory savings.... While there's a few more optimizations to make in that module alone, I'm fairly certain that we'll need to be critical of most every step in the call chain here.
I saw a note you left about trying to use something leaner that individual column instances to load the data. Great idea! We'll want a column around up front for a few convenience methods and utilities in the preprocessing / view creation, but once the samples have been split and batched, the column is absolutly not needed. I'll help out with this one, but it's as simple as building a few dictionaries mapping samples to backend locations, opening the backend file handles of interest with

hangar-py/src/hangar/columns/common.py

Lines 99 to 121 in 01c94bd

    
           def open_file_handles(backends, path, mode, schema, *, remote_operation=False): 
        
               """Open backend accessor file handles for reading 
        
               Parameters 
        
               ---------- 
        
               backends : Set[str] 
        
                   if ``mode == 'r'`` then this should be the used backend format 
        
                   codes in the column. if ``mode == 'a'``, then this should be a 
        
                   list of the allowed backend format codes this schema can feasably 
        
                   write to. 
        
               path : Path 
        
                   path to the hangar repository on disk 
        
               mode : str 
        
                   one of ['r', 'a'] indicating read or write mode to open backends in. 
        
               schema : ColumnDefinitionTypes 
        
                   schema spec so required values can be filled in to backend openers. 
        
               Returns 
        
               ------- 
        
               AccessorMapType 
        
                   dict mapping backend format codes to initialized instances of each 
        
                   read-only backend. 
        
               """

and then calling the same 2 lines of code we use in the column classes to read from disk.

spec = self._samples[key]
# self._be_fs = dict['backend_codes'] -> backend reader instances.
return self._be_fs[spec.backend].read_data(spec)

3. API:

There are two issues I see here:

Collating data in different columns which have different types.
Specifying views across columns which have different layouts.

It took a minute to see, but these are the exact same problems we've solved in the rest of Hangar's API..

1. Column/Schema Types

We've already solved the "incompatible data type" problem with a plugable system of column schemas... Since each schema knows exactally what it is and how to process data passing through the column API, we can rely on it to collate/stack batches of data correctly.

for the ndarray columns, we would of course just stack arrays along a new access to create the batch...
we should do the same with the str typed column schemas, instead of calling np.stack, it should be a list.append() operation (or something else which makes sense for how python strings are grouped together in a container, it doesn't really matter...)
The point is that we can't stick an 'np.ndarray' AND 'str' typed piece of data together in any way that makes sense. What we can do is put each piece of data together appropriately for the type of column that it is stored in.

As far as for how it's returned to the user, we handle that the same way we handle everything else: order in == order out. If you specify

dset = make_numpy_dataset(
    columns=['train_images', 'train_labels'], 
    keys=('foo', 'bar', 'baz', 'foo', 'bar2'), 
    batch_size=2, 
    drop_last=False)

for data_batch in dset:
    print(data_batch)

It should output something like the following in a regular tuple. where len(data_batch) == len(columns) and the order of output column results is exactally the order in which they appeared in the input columns argument to make_foo_dataset().

The order of samples returned in each batch - and it's index in the batch - should exactaly match the ordering of their position in the input keys sequence unless shuffle=True is specified. We don't deduplicate requests for the same keys, we don't mutate order or structure in any way unless explicitly asked to. For example:

([[img_foo_data], [img_bar_data]],  [label_foo_data, label_bar_data])
([[img_baz_data], [img_foo_data]],  [label_baz_data, label_foo_data])
([img_bar2_data], [label_bar2_data])

If we preserve order unless explicitly instructed otherwise we can eliminate a lot of hardship right up front: simple to use and understand, as well as highly customizable based on the end user application.

In regards to the type of container the output is stored in:

namedtuple is a no-go since dynamic definition is a disaster. The choice is between tuples and dicts. I'm strongly in favor of tuples (so long as we preserve ordering like I outlined above),. I don't see a clear need to return a mapping in most cases, and creating many thousands of dicts is actually a non-trivial expense if we don't need to use them. I think returning tuples as a default and including an option to return a dict instead would provide enough flexibility for most people.?

2. Specifying Views / Column Layouts

The column layouts were created with the express purpose of allowing for access/usage paterns which are wholly incompatible. There may be similarities and best practices, but at the end of the day we can't force a single-level store to be used in the same way a nested dictionary would be.

I'm going to propose we use a slightly modified version of the syntax used to retrieve a view of some sample / columns from the checkout object.

>>> dset[('nested_col', 'sample_1', '0'), ('foo', '0')]
[np.array([1, 0]), np.array([0])]

>>> dset[('nested_col', 'sample_1', ...), ('foo', '0')]
[{'subsample_0': np.array([1, 0]), 'subsample_1': np.array([1, 1])},
 np.array([0])]

>>> dset[('foo', '0'), ('nested_col', 'foo', 0:3)]
[np.array([0]),
 {'subsample_0': np.array([1, 0]),
  'subsample_1': np.array([1, 1]),
  'subsample_2': np.array([2, 2])}]

But instead of having to specify the column every time, the sample keys for one view would be nested in a list and matched to their column by index. Something like this?

>>> make_numpy_dataset(column=('nested_col', 'foo'),  keys=[(('sample_1', '0'), '0'),])
 
>>> make_numpy_dataset(
...     column=('nested_col', 'foo'),
...     keys=[(('sample_1', ...), '1'), (('sample_2', ...), '2')]
... )

# it's easy enough to generate keys this way as well.
>>> generated_keys = []
>>> for sample in train_col.keys():
...     samp_key = ((sample, ...), sample, ('train_category', sample))
...     generated_keys.append(samp_key)
>>>
>>> make_numpy_dataset(('nested_col', 'foo', 'nested_categories'), keys=generated_keys])

Still a bit of a WIP, but it'd be nice to be consistent across the platform and this seems like a reasonable solution.

At any rate, I think we could knock this out in a short bit and push it to 0.5.1? Let me know your thoughts!

CLAassistant · 2020-04-10T22:19:41Z

All committers have signed the CLA.

* New Hangar Quick Start Tutorial - download and organize Imagenette data * Add Column initialization, Adding data and Committing changes section * Fix validation data loading, add more comments * Add Alessia as author * Change name of the checkout object, add a note for larger dataset reading * Update Tutorial-QuickStart.ipynb * Fix paths and commit id * Change title, minor fixes * Fix typo * Minor fixes Co-authored-by: Rick Izzo <rlizzo@users.noreply.github.com>

* updated changelog * Bump version: 0.5.1 → 0.5.2

Experimental decorator, documentation, type checking torch dataset can return OrderedDict or tuple guarding import exceptions for machine learning libraries that may not exist on user system refactorings and a few performance considerations to make working with batching keys easier.

hhsecond · 2020-07-22T06:12:16Z

I found that the commit history is somehow screwed up and not fixable without careful analysis and probably that's not worth it. I am closing this PR in favor of creating a #206

hhsecond changed the title ~~Better Dataloader implementation~~ Better Dataset implementation Mar 9, 2020

hhsecond requested a review from rlizzo March 15, 2020 15:42

hhsecond marked this pull request as ready for review March 19, 2020 06:34

elistevens reviewed Mar 19, 2020

View reviewed changes

src/hangar/dataset/common.py Outdated Show resolved Hide resolved

elistevens reviewed Mar 19, 2020

View reviewed changes

src/hangar/dataset/common.py Outdated Show resolved Hide resolved

elistevens reviewed Mar 19, 2020

View reviewed changes

src/hangar/dataset/common.py Outdated Show resolved Hide resolved

elistevens reviewed Mar 19, 2020

View reviewed changes

src/hangar/dataset/torch.py Outdated Show resolved Hide resolved

hhsecond commented Apr 4, 2020

View reviewed changes

rlizzo reviewed Apr 5, 2020

View reviewed changes

rlizzo mentioned this pull request Apr 8, 2020

New Feature: Bytes Data Type Column #198

Merged

18 tasks

rlizzo and others added 6 commits July 22, 2020 07:44

included hangar.yml file for escape project

7ee397d

Quick Start tutorial - first draft

42db050

Add section Quick Start for the Impatient

9acd73b

Release v0.5.2 (#202)

2896d7e

* updated changelog * Bump version: 0.5.1 → 0.5.2

hhsecond mentioned this pull request Jul 22, 2020

New API design for datasets #206

Merged

29 tasks

hhsecond closed this Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Dataset implementation #187

Better Dataset implementation #187

hhsecond commented Mar 9, 2020 •

edited

Loading

codecov bot commented Mar 9, 2020 •

edited

Loading

hhsecond Apr 4, 2020

hhsecond Apr 4, 2020

rlizzo Apr 5, 2020

hhsecond Apr 5, 2020 •

edited

Loading

rlizzo Apr 5, 2020

elistevens Apr 6, 2020

rlizzo Apr 6, 2020

hhsecond Apr 30, 2020

rlizzo left a comment

CLAassistant commented Apr 10, 2020 •

edited

Loading

hhsecond commented Jul 22, 2020

	def open_file_handles(backends, path, mode, schema, *, remote_operation=False):
	"""Open backend accessor file handles for reading

	Parameters
	----------
	backends : Set[str]
	if ``mode == 'r'`` then this should be the used backend format
	codes in the column. if ``mode == 'a'``, then this should be a
	list of the allowed backend format codes this schema can feasably
	write to.
	path : Path
	path to the hangar repository on disk
	mode : str
	one of ['r', 'a'] indicating read or write mode to open backends in.
	schema : ColumnDefinitionTypes
	schema spec so required values can be filled in to backend openers.

	Returns
	-------
	AccessorMapType
	dict mapping backend format codes to initialized instances of each
	read-only backend.
	"""

Better Dataset implementation #187

Better Dataset implementation #187

Conversation

hhsecond commented Mar 9, 2020 • edited Loading

Motivation and Context

Description

Types of changes

How Has This Been Tested?

Checklist:

codecov bot commented Mar 9, 2020 • edited Loading

Codecov Report

hhsecond Apr 4, 2020

Choose a reason for hiding this comment

hhsecond Apr 4, 2020

Choose a reason for hiding this comment

rlizzo Apr 5, 2020

Choose a reason for hiding this comment

hhsecond Apr 5, 2020 • edited Loading

Choose a reason for hiding this comment

rlizzo Apr 5, 2020

Choose a reason for hiding this comment

In any case, I think we lucked out in our current config.

elistevens Apr 6, 2020

Choose a reason for hiding this comment

rlizzo Apr 6, 2020

Choose a reason for hiding this comment

Short answer:

Longer Answer

The issue for us isn't actually with environments that don't have torch or tensorflow installed. It's for the ones that do.*

hhsecond Apr 30, 2020

Choose a reason for hiding this comment

rlizzo left a comment

Choose a reason for hiding this comment

1. On the rename:

2. Performance:

3. API:

1. Column/Schema Types

In regards to the type of container the output is stored in:

2. Specifying Views / Column Layouts

CLAassistant commented Apr 10, 2020 • edited Loading

hhsecond commented Jul 22, 2020

hhsecond commented Mar 9, 2020 •

edited

Loading

codecov bot commented Mar 9, 2020 •

edited

Loading

hhsecond Apr 5, 2020 •

edited

Loading

The issue for us isn't actually with environments that don't have `torch` or `tensorflow` installed. It's for the ones that do.*

CLAassistant commented Apr 10, 2020 •

edited

Loading