-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better Dataset implementation #187
Conversation
Codecov Report
@@ Coverage Diff @@
## master #187 +/- ##
==========================================
- Coverage 95.44% 95.12% -0.33%
==========================================
Files 95 98 +3
Lines 15622 15584 -38
Branches 1487 1484 -3
==========================================
- Hits 14910 14823 -87
- Misses 469 510 +41
- Partials 243 251 +8
|
src/hangar/__init__.py
Outdated
if name == 'make_tf_dataset': | ||
from .dataloaders import make_tf_dataset | ||
return make_tf_dataset | ||
def __dir__(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlizzo For the reference, this makes hangar incompatible for python 3.6, AFIK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might have to go back to our old way of lazy loading with classes. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey so now that 0.5 is finally released I'm giving this the time it deserves.
Unfortunately I had to totally revert these changes since I didn't see a way around the py3.6 issues without messing with the sys.path
config (a bit no-go for me), and it was even breaking 3.7/3.8 on windows. We've got a fairly efficient solution to a rather difficult problem; it does the job, and is easily understandable, this isn't an area of the codebase that needs to get complicated IMO.
Style points aside, I do think we're generally OK with the status-quo from a usability perspective? The make_foo_dataset
methods are exposed explicitly via the __all__
variable. They're available seamlessly in a script, and are visible / introspectable when working in the repl
. Have you ran into issues anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__getattr__
& __dir__
on-a-module is introduced in python 3.7. And hence
from hangar import make_numpy_dataset # will work because it is outside the guard
from hangar import make_torch_dataset # won't work
from hangar import make_tensorflow_dataset # won't work
This is my understanding. I haven't tried it out but I am pretty sure this is the case for python 3.6. You are talking about this same issue, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually wasn't aware of that PEP, but it seems that we've lucked out.
When I was debugging it I actually saw the opposite
from hangar import make_numpy_dataset # won't work
from hangar import make_torch_dataset # works
from hangar import make_tensorflow_dataset # works
Wether make_numpy_dataset
was introduced in a guard didn't seem to matter (maybe since it would never raise an ImportError
?). Nothing I did consistently worked across all the test platforms, so I dropped it.
In any case, I think we lucked out in our current config.
If an attribute is not found on a module object through the normal lookup (i.e.
object.__getattribute__
), then__getattr__
is searched in the module__dict__
before raising anAttributeError
. If found, it is called with the attribute name and the result is returned. Looking up a name as a module global will bypass module__getattr__
. This is intentional, otherwise calling__getattr__
for builtins will significantly harm performance.
We set the module level global the first time that __getattr__
is called. This makes alot more sense now about why that was necessary....
If you run across another way to lazyload it might be work looking into, I wasn't thrilled to read
This PEP may break code that uses module level (global) names
__getattr__
and__dir__
. (But the language reference explicitly reserves all undocumented dunder names, and allows "breakage without warning"; see [3].)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the deal with lazy loading anything? If the user tries to import a file that imports torch
and they don't have torch installed, let them get an ImportError
. If all this is about trying to massage the namespace so that everything either appears or doesn't in a top level module... Don't do that.
It makes it harder to find the actual source code, and it confuses the issue, since if I'm missing another package, I'd like to get an ImportError that tells me that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer:
- It's actually a major performance issue if we don't do some type of lazy-loading here.
Longer Answer
In order to have the make_torch_dataset
and make_tf_dataset
methods available as a top-level import:
from hangar import make_torch_dataset
# compared to
from hangar.datasets.toch_loader import make_torch_dataset
we have to make them "importable" in the top-level __init__.py
file. Since torch
and tensorflow
are optional dependencies, we have to at-least guard against environments missing one or both of the packages from raising ImportError
/ ModuleNotFoundError
as hangar is imported.
That in itself is fairly trivial to do. If the module exists then great, it gets imported as normal; if not, you can either mock it out, or use some importlib
/inspect
magic to cut the execution path short....
The issue for us isn't actually with environments that don't have torch
or tensorflow
installed. It's for the ones that do.*
These are really huge packages with a ton of dependencies themselves... Importing them takes a noticeably long time, and having such massive dependencies sitting in RAM and polluting the package namespace accrues a non-trivial slowdown over time.
We use the "lazy-loaders" to stop the execution path of top-level hangar imports short of any file containing import torch
or import tensorflow
statments (unless explicitly requested). This occurs irregardless of if torch
or tensorflow
are installed in the environment. The cost of keeping those packages around should only be paid if their actually being used :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlizzo I understand that you had reverted the guard with __getattr__
but found that the functions are moved from the top level to inside dataset
from hangar.dataset import make_xxx_dataset
Was that intentional. I'd still keep it on the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I like where this is heading @hhsecond! Nice job!
As you'll undoubtedly notice, I've put a bit of time into this in the last two days... It's actually been relaxing to focus on something sane after the automation hell I've been through to get 0.5 built correctly on all the different platforms this past week...
A few points:
1. On the rename:
I really like this! dataset
makes so much more sense than dataloaders
. We need to push the "a dataset is just a view of pieces in various columns` message hard. I think this will help.
2. Performance:
I like how lean you've kept this, it's a modest piece of code doing a fairly large bit of work. Digging in I got a bit worried about the time/memory cost tho. Making this performant when dealing with large repos is going to be our main challenge going forward. We need to make sure this doesn't collapse if repos have many columns which each might contain hundrends of thousands / millions of samples.
-
I've already taken a first pass at the
HangarDataset
class. Generators + set comprehensions / in-place updates seems to be the way to go. I only got through the one module so far, but a few benchmarks show significant speed/memory savings.... While there's a few more optimizations to make in that module alone, I'm fairly certain that we'll need to be critical of most every step in the call chain here. -
I saw a note you left about trying to use something leaner that individual
column
instances to load the data. Great idea! We'll want a column around up front for a few convenience methods and utilities in the preprocessing / view creation, but once the samples have been split and batched, the column is absolutly not needed. I'll help out with this one, but it's as simple as building a few dictionaries mapping samples to backend locations, opening the backend file handles of interest with
hangar-py/src/hangar/columns/common.py
Lines 99 to 121 in 01c94bd
def open_file_handles(backends, path, mode, schema, *, remote_operation=False): | |
"""Open backend accessor file handles for reading | |
Parameters | |
---------- | |
backends : Set[str] | |
if ``mode == 'r'`` then this should be the used backend format | |
codes in the column. if ``mode == 'a'``, then this should be a | |
list of the allowed backend format codes this schema can feasably | |
write to. | |
path : Path | |
path to the hangar repository on disk | |
mode : str | |
one of ['r', 'a'] indicating read or write mode to open backends in. | |
schema : ColumnDefinitionTypes | |
schema spec so required values can be filled in to backend openers. | |
Returns | |
------- | |
AccessorMapType | |
dict mapping backend format codes to initialized instances of each | |
read-only backend. | |
""" |
and then calling the same 2 lines of code we use in the column
classes to read from disk.
spec = self._samples[key]
# self._be_fs = dict['backend_codes'] -> backend reader instances.
return self._be_fs[spec.backend].read_data(spec)
3. API:
There are two issues I see here:
- Collating data in different columns which have different types.
- Specifying views across columns which have different layouts.
It took a minute to see, but these are the exact same problems we've solved in the rest of Hangar's API..
1. Column/Schema Types
We've already solved the "incompatible data type" problem with a plugable system of column schemas... Since each schema knows exactally what it is and how to process data passing through the column API, we can rely on it to collate/stack batches of data correctly.
- for the
ndarray
columns, we would of course just stack arrays along a new access to create the batch... - we should do the same with the
str
typed column schemas, instead of callingnp.stack
, it should be alist.append()
operation (or something else which makes sense for how python strings are grouped together in a container, it doesn't really matter...) - The point is that we can't stick an
'np.ndarray' AND 'str'
typed piece of data together in any way that makes sense. What we can do is put each piece of data together appropriately for the type of column that it is stored in.
As far as for how it's returned to the user, we handle that the same way we handle everything else: order in == order out
. If you specify
dset = make_numpy_dataset(
columns=['train_images', 'train_labels'],
keys=('foo', 'bar', 'baz', 'foo', 'bar2'),
batch_size=2,
drop_last=False)
for data_batch in dset:
print(data_batch)
It should output something like the following in a regular tuple. where len(data_batch) == len(columns)
and the order of output column results is exactally the order in which they appeared in the input columns
argument to make_foo_dataset()
.
The order of samples returned in each batch - and it's index in the batch - should exactaly match the ordering of their position in the input keys
sequence unless shuffle=True
is specified. We don't deduplicate requests for the same keys, we don't mutate order or structure in any way unless explicitly asked to. For example:
([[img_foo_data], [img_bar_data]], [label_foo_data, label_bar_data])
([[img_baz_data], [img_foo_data]], [label_baz_data, label_foo_data])
([img_bar2_data], [label_bar2_data])
If we preserve order unless explicitly instructed otherwise we can eliminate a lot of hardship right up front: simple to use and understand, as well as highly customizable based on the end user application.
In regards to the type of container the output is stored in:
namedtuple
is a no-go since dynamic definition is a disaster. The choice is between tuples
and dicts
. I'm strongly in favor of tuples (so long as we preserve ordering like I outlined above),. I don't see a clear need to return a mapping in most cases, and creating many thousands of dicts
is actually a non-trivial expense if we don't need to use them. I think returning tuples as a default and including an option to return a dict
instead would provide enough flexibility for most people.?
2. Specifying Views / Column Layouts
The column layouts were created with the express purpose of allowing for access/usage paterns which are wholly incompatible. There may be similarities and best practices, but at the end of the day we can't force a single-level store to be used in the same way a nested dictionary would be.
I'm going to propose we use a slightly modified version of the syntax used to retrieve a view of some sample / columns from the checkout
object.
>>> dset[('nested_col', 'sample_1', '0'), ('foo', '0')]
[np.array([1, 0]), np.array([0])]
>>> dset[('nested_col', 'sample_1', ...), ('foo', '0')]
[{'subsample_0': np.array([1, 0]), 'subsample_1': np.array([1, 1])},
np.array([0])]
>>> dset[('foo', '0'), ('nested_col', 'foo', 0:3)]
[np.array([0]),
{'subsample_0': np.array([1, 0]),
'subsample_1': np.array([1, 1]),
'subsample_2': np.array([2, 2])}]
But instead of having to specify the column every time, the sample keys for one view would be nested in a list and matched to their column by index. Something like this?
>>> make_numpy_dataset(column=('nested_col', 'foo'), keys=[(('sample_1', '0'), '0'),])
>>> make_numpy_dataset(
... column=('nested_col', 'foo'),
... keys=[(('sample_1', ...), '1'), (('sample_2', ...), '2')]
... )
# it's easy enough to generate keys this way as well.
>>> generated_keys = []
>>> for sample in train_col.keys():
... samp_key = ((sample, ...), sample, ('train_category', sample))
... generated_keys.append(samp_key)
>>>
>>> make_numpy_dataset(('nested_col', 'foo', 'nested_categories'), keys=generated_keys])
Still a bit of a WIP, but it'd be nice to be consistent across the platform and this seems like a reasonable solution.
At any rate, I think we could knock this out in a short bit and push it to 0.5.1? Let me know your thoughts!
* New Hangar Quick Start Tutorial - download and organize Imagenette data * Add Column initialization, Adding data and Committing changes section * Fix validation data loading, add more comments * Add Alessia as author * Change name of the checkout object, add a note for larger dataset reading * Update Tutorial-QuickStart.ipynb * Fix paths and commit id * Change title, minor fixes * Fix typo * Minor fixes Co-authored-by: Rick Izzo <rlizzo@users.noreply.github.com>
* updated changelog * Bump version: 0.5.1 → 0.5.2
Experimental decorator, documentation, type checking torch dataset can return OrderedDict or tuple guarding import exceptions for machine learning libraries that may not exist on user system refactorings and a few performance considerations to make working with batching keys easier.
I found that the commit history is somehow screwed up and not fixable without careful analysis and probably that's not worth it. I am closing this PR in favor of creating a #206 |
Motivation and Context
A continuation from #62 and should also solve #79
Description
Types of changes
What types of changes does your code introduce? Put an
x
in all the boxes that apply:Is this PR ready for review, or a work in progress?
How Has This Been Tested?
Put an
x
in the boxes that apply:Checklist: