Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Criteo dataloader #642

Merged
merged 38 commits into from
Mar 19, 2019
Merged

Criteo dataloader #642

merged 38 commits into from
Mar 19, 2019

Conversation

miguelgfierro
Copy link
Collaborator

Description

Criteo dataload in python and spark
Smoke and integration

Related Issues

#555

Checklist:

  • My code follows the code style of this project, as detailed in our contribution guidelines.
  • I have added tests.
  • I have updated the documentation accordingly.

Copy link
Contributor

@motefly motefly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry that you have to bear the brunt of my grumpiness with how the data loading is setup =)
we can simplify this section and avoid creating unnecessary single use functions or convenience classes that make the code more brittle to changes to dependencies. this will also limit the test surface.

reco_utils/common/python_utils.py Outdated Show resolved Hide resolved
reco_utils/dataset/criteo_dac.py Outdated Show resolved Hide resolved
reco_utils/dataset/criteo_dac.py Outdated Show resolved Hide resolved
reco_utils/dataset/criteo_dac.py Outdated Show resolved Hide resolved
reco_utils/dataset/criteo_dac.py Outdated Show resolved Hide resolved
reco_utils/dataset/criteo_dac.py Outdated Show resolved Hide resolved
reco_utils/dataset/movielens.py Outdated Show resolved Hide resolved
reco_utils/dataset/url_utils.py Show resolved Hide resolved
reco_utils/dataset/criteo_dac.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@loomlike loomlike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check my comment for potential misbehavior that might be caused by removing the clean-up codes.

reco_utils/dataset/criteo_dac.py Show resolved Hide resolved


@contextmanager
def _real_path(path):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loomlike we should do DRY here. However, this code is a little bit different to what you have in movielens. I'm not sure how to homogenize

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. What about this:

  1. Put this real_path to url_utils (btw we talked about renaming this to dataset_utils, right?)
  2. Use that function from criteo.py
  3. Refactor movielens to use the same function. Probablly, to check "zip" file name from outside of this function. - either you or I can refactor this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks reasonable to me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. @loomlike if you are free today, feel free to push directly to the branch. If not I'll do it tomorrow

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loomlike I added an issue for the renaming #662

Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, let's merge this!



@contextmanager
def _real_path(path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks reasonable to me

@miguelgfierro miguelgfierro merged commit df96631 into staging Mar 19, 2019
@miguelgfierro miguelgfierro deleted the jeremr-criteo-dataloader branch March 19, 2019 22:02
yueguoguo pushed a commit that referenced this pull request Sep 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants