Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring of DataSource component for easier, future, extensibility #26

Open
RicardoJSandoval opened this issue Jul 28, 2022 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@RicardoJSandoval
Copy link
Contributor

RicardoJSandoval commented Jul 28, 2022

Hi! I'm proposing a refactoring of folktables's DataSource component (refer to this ZIP to see the changes already implemented).

Proposed changes

In this section I describe the changes I'm proposing that will affect the structure of the DataSource component.

  1. Creating utilities that help to (down)load different datasets: many of the functionalities already coded in the current version of folktables can be reused to (down)load different datasets (e.g., making the HTTP request to download a dataset from the U.S. Census Bureau's website, saving the content of the HTTP request into local memory, ...). Hence, I'm proposing creating utilities for these common tasks as it will allow us to reuse the code in the case that more datasets will be supported.
  2. Added the ability to download files asynchronously.
  3. Changing the DataSource from an ABC to a Protocol: as I see it, the role of DataSource is to define an interface for classes that help with the (down)loading of different datasets. By having DataSource be a Protocol we'll be taking advantage of Python's duck typing and will make it more straight forward that this is the interface that should be followed (plus, we have the added benefit that we do not have to directly inherit from DataSource whenever defining this component for a different dataset).
  4. Introducing DownloadResource, LoadResource, and FilesResource: these are dataclasses that contain information pertaining to file paths and URLs for the different datasets the user wants to (down)load. These objects (note that FilesResource just contains instances of DownloadResource and LoadResource) makes it easier to keep track of the resources needed to (down)load files.
  5. Refactored load_acs.py and the ACSDataSource class in acs.py into one file: in this refactoring I'm following the re-defined DataSource interface, I'm using the utility functions I created or refactored, and I'm taking advantage of the *Resource objects defined. One additional thing I did was to turn the hard coded URL into a "constant" variable.
  6. Introducing custom exceptions.

Proposed file structure

folktables
|     __init__.py
|     folktables.py
|     exceptions.py
|     acs.py
|_____data_source
|     |     __init__.py
|     |     acs_data_source.py
|     |     data_source.py
|_____utils
|     |     __init__.py
|     |     download_utils.py
|     |     files_resource.py
|     |     load_utils.py
@RicardoJSandoval RicardoJSandoval changed the title Refactoring of DataSource component for easier, future, extensibility & introduction of SIPPDataSource Refactoring of DataSource component for easier, future, extensibility Jul 28, 2022
@mrtzh
Copy link
Member

mrtzh commented Jul 31, 2022

Hi Ricardo, this is a hugely valuable effort. Thanks so much!

I just had a look at the zip file that you link. This generally all looks good. There are a few higher level points to discuss and then some suggestions and minor comments.

Higher-level points:

There's a balance to strike between the generality of the code and the needs of the typical machine learning researcher. When changing the API, we should keep this in mind. A typical ML grad student should be able to create their own data sources quickly without too much of a learning curve. Otherwise instead of using folktables, researchers will go and write their own ad-hoc code.

Concrete suggestions:

  1. I probably would've skipped the extra layer of the FilesResource abstraction, collapsing download_utils.py and load_utils.py into one.
  2. It's nice to have asynchronous downloads. I wasn't able to tell though if this is optional from the perspective of the data source creator. Does this mean that someone who wants to instantiate DataSource(Protocol) will have to know what the async keyword in async def _download_data entails? If so, this might provide a bit of an extra hurdle to creating new data sources.
  3. Is it possible to break this up into a few different pull requests so that it's easier to review each individual component?

Minor comments:

  • files_resources.py has documentation that mention SIPP. I assume these should be general. Likewise acs_data_source.py has a few methods were SIPP shows up in the doc string.

Paging @francesding @millerjohnp @ludwigschmidt since this touches on some major design choices. Please let us know if you have any comments or concerns.

@mrtzh mrtzh added the enhancement New feature or request label Jul 31, 2022
@ludwigschmidt
Copy link
Collaborator

Hi Ricardo,

Thanks for integrating SIPP, it will be a great addition to folktables!

A few points below:

  • As Moritz said, it would be good to make this a real pull request (as opposed to linking to a zip file). That makes it easier to see the code changes in the Github UI, comment on individual lines, etc.

  • Sharing code between the ACS download routines and SIPP is great. What I don't fully understand right now is how a future contributor should use these building blocks. For instance, what is the purpose of _load_data and _download_data in the DataSource interface? If these functions can be called from code external to DataSource, why do the names start with an underscore? If the functions are only for use internal to DataSource implementations, why are they part of the interface?

  • Asynchronous downloads are great to increase download speed if all the resulting complexity is handled in the DataSource API. I think users of DataSource should not have to know about asynchronous Python since it can be a bit complicated.

  • The utils folder should contain only code that is independent of a specific data source. The documentation string in files_resources.py says "Stores information needed to (down)load the SIPP dataset." Should this be in one of the SIPP files?

  • Re-using code is good, but there is also a cost to creating abstractions. For instance, I'm not sure if we need the load_json function in load_utils.py. The amount of code we save is only one line, and the two lines in the function for loading a JSON are relatively easy to understand. Introducing a new function makes the code harder to read for others who don't know exactly what the function is doing but would understand the two common Python lines for loading a JSON.

@millerjohnp
Copy link
Collaborator

Hi, thanks for the proposal and initial changes! Beyond what others have mentioned, just want to understand if this is targeting a new dataset or collection of datasets? Or is there a particular part of the current interface you find confusing or annoying?

I think it'd be easier to reason about the costs/benefits of interface changes if there's actually a concrete target. Additionally, I like the lightweight interface, and I think it only really makes sense to add a lot of additional complexity and spend dev time implementing and maintaining something it if we're actually getting a new resource or improving quality of life for users.

@RicardoJSandoval
Copy link
Contributor Author

RicardoJSandoval commented Aug 1, 2022

Hi all,

Thanks for all the feedback and comments. I'll address all of them here.

Feedback

Purpose of the proposed changes

The main idea behind the changes I've made is to keep the user facing API intact while changing the way in which the implementation is done. This is in order to make it easier to integrate other datasets in the future. Hence, the example you all list in the README would still be valid:

data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')
acs_data = data_source.get_data(states=["CA"], download=True)

The way to use the SIPPDataSource is very similar, the only difference being the parameters you have to pass to the constructor and to get_data.

I started to work on these changes when I started implementing the functionality to (down)load the SIPP dataset. I found a lot of the existing code to be useful but couldn't access it for my purposes as it was tailored for the ACS dataset. Hence, I went ahead and refactored the existing code to make it more modular and reusable.

FilesResource

You all are right; SIPP shouldn't be mentioned at all in files_resource.py as FilesResource is agnostic of any dataset/datasource. Rather, it's just supposed to contain information that till be useful to (down)load any dataset. I'll go ahead and remove any mention of the SIPP dataset from this file.

I originally thought of having the DownloadResource and LoadResource abstractions as they help separate the behavior of the core functionalities the DataSource provides. However, I don’t mind getting rid of them and just assigning all the fields to FilesResource.

Asynchronous downloads

Yes, the idea is that all asynchronous downloads are handled internally and that the implementation details are hidden from the user. You can use the ACSDataSource without having to worry about what async and await entails. This comment made me rethink the design and I think it would probably be best to keep the DataSource objects free from any async code and push it all (if we end up deciding to do so) to utils.download_utils.py.

There is, however, a caveat with asyncio’s to_thread coroutine: it was introduced in Python's version 3.9, which means that it might limit some backwards compatibility.

DataSource’s _load_data and _download_data

My idea of including _load_data and _download_data in DataSource was to make it explicit that these are the two core functionalities that DataSource objects provide (and it also made it easier for me to divide the code into these functions). However, we can certainly remove them from this protocol class as they don’t really add much.

load_json

This is a good point, Ludwig. I can go ahead and remove this function and directly use the JSON loading code in the rest of the code.

Moving forward

Following what Moritz and Ludwig proposed, I’ll go ahead and break this up into a couple of pull requests:

  1. I’ll start by making a pull request for the utils directory as I think this is the easiest to reason about, since it doesn’t depend on any other code in folktables and it’s the building block for future pull requests (it also won’t affect any of the code you all currently have).
  2. Once you’ve all decided on the utils directory, I can go ahead and push the code for DataSource and the refactored ACSDataSource.
  3. Lastly, I can push the code for the SIPPDataSource.

Please let me know what you all think about this plan and if you all have any other questions!

@mrtzh
Copy link
Member

mrtzh commented Aug 1, 2022

SGTM - Thanks so much!

@millerjohnp
Copy link
Collaborator

Sounds great to me! Re: asynchronous downloads, it might be nice to maintain some flexibility with python versions. concurrent.futures might be one alternative to asyncio

@ludwigschmidt
Copy link
Collaborator

Regarding Python versions: apart from the async part, the switch from ABC to Protocol also requires at least Python 3.8 because Protocol was introduced then. I'm not sure what our policy on Python versions should be. According to our unit tests, we currently support 3.7 onwards. Python 3.8 was released in October 2019, so it may well be that it's widely used by now and doesn't cause trouble for users of folktables. Any thoughts?

@ludwigschmidt
Copy link
Collaborator

Also just to clarify my comment regarding _load_data and _download_data. If we have a use case where code external to DataSource needs to call these functions, we can certainly introduce them to the interface. Based on my understanding of the code right now, the two functions are only called by DataSource objects internally. If that's the case, we don't need to add the functions to the externally visible interface. As John mentioned, it's nice to have simple interfaces and hide the complexity inside implementations. But I may also just be misunderstanding the re-organized code right now :-)

If we have a use case for extending the DataSource interface to include the two new functions, it would be good to rename the functions and remove the leading underscore. Python convention is to use leading underscores for internal function names, e.g., see https://stackoverflow.com/questions/1301346/what-is-the-meaning-of-single-and-double-underscore-before-an-object-name .

Thank you for making the changes!

@RicardoJSandoval
Copy link
Contributor Author

@ludwigschmidt You are right, my idea for the _load_data and _download_data methods is for them to only be called internally by DataSource objects. I'll go ahead and remove them from the externally visible interface.

I'll also go ahead and revert back to using ABC instead of protocol in order to still be able to support Python 3.7, and will switch from using asyncio to concurrent.futures.

I'll go ahead and send the first pull request shortly.

@francesding
Copy link
Collaborator

francesding commented Aug 2, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants