Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Movielens 100k tabular dataset to basicdatasets #1718

Merged
merged 4 commits into from
Jun 16, 2022

Conversation

siddvenk
Copy link
Contributor

Description

Adding the Movielens 100k dataset to the tabular datasets. This dataset can be used to train regression models, collaborative filtering models, Graph Neural Networks, and a host of other recommendation system models.

https://grouplens.org/datasets/movielens/100k/

Note

The tests right now are using a local instace of the repo since the metadata hasn't been published to S3. I'll update the tests accordingly once the metadata.json has been pushed to S3 and I can verify using the remote bucket works.

@siddvenk
Copy link
Contributor Author

Ubuntu/Windows builds are failing because the associated unit test is failing. This is expected with the way I hard coded the path for the repository used in the test. Seems like there are 2 ways I can resolve this:

  1. Use a relative path in the repo - test will still use a local repo but relative path should at least make finding the files work
  2. Use the default repo that can only work once the metatdata has been added to the djl repo.

I see examples of both being used for the basicdataset tests - do we have a preference?

@lanking520
Copy link
Contributor

Not an expertise in this part, but I think you can try to follow: #1667 this PR to see how they do the test. @zachgk any thoughts?

@siddvenk
Copy link
Contributor Author

@lanking520 I was thinking of doing something similar to that. I wonder whether it makes sense to standardize across the basicdataset tests? Tests for Cifar10, Mnist, ImageNet, Wiki2Text are using a local repo in the tests, while others are using djl repo from what I can tell (they're not providing a repo when building, so it will use the default djl repo).

Using the djl repo in the tests seems better to me to validate the data is stored correctly, configuration in the code is correct, and the data can be pulled to be built as a dataset. With local repo we can avoid some calls to the repo to fetch the metadata and files, but since the files are cached and don't change often I'm not sure if there's much benefit using a local repo in the tests.

@codecov-commenter
Copy link

codecov-commenter commented Jun 16, 2022

Codecov Report

Merging #1718 (e01cc90) into master (bb5073f) will decrease coverage by 1.54%.
The diff coverage is 70.71%.

@@             Coverage Diff              @@
##             master    #1718      +/-   ##
============================================
- Coverage     72.08%   70.54%   -1.55%     
- Complexity     5126     5538     +412     
============================================
  Files           473      525      +52     
  Lines         21970    24617    +2647     
  Branches       2351     2662     +311     
============================================
+ Hits          15838    17367    +1529     
- Misses         4925     5928    +1003     
- Partials       1207     1322     +115     
Impacted Files Coverage Δ
api/src/main/java/ai/djl/modality/cv/Image.java 69.23% <ø> (-4.11%) ⬇️
...rc/main/java/ai/djl/modality/cv/output/Joints.java 71.42% <ø> (ø)
.../main/java/ai/djl/modality/cv/output/Landmark.java 100.00% <ø> (ø)
...main/java/ai/djl/modality/cv/output/Rectangle.java 72.41% <0.00%> (ø)
...i/djl/modality/cv/translator/BigGANTranslator.java 21.42% <ø> (-5.24%) ⬇️
...odality/cv/translator/BigGANTranslatorFactory.java 33.33% <0.00%> (+8.33%) ⬆️
...nslator/InstanceSegmentationTranslatorFactory.java 14.28% <0.00%> (-3.90%) ⬇️
.../ai/djl/modality/cv/translator/YoloTranslator.java 8.33% <0.00%> (-0.50%) ⬇️
.../modality/cv/translator/YoloTranslatorFactory.java 8.33% <0.00%> (-1.67%) ⬇️
...i/djl/modality/cv/translator/YoloV5Translator.java 5.38% <0.00%> (-0.31%) ⬇️
... and 287 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95ffc30...e01cc90. Read the comment docs.

Copy link
Contributor

@zachgk zachgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the metadata to add a name to the license. That way, if we show the license name somewhere it would at least have some reasonable information conveyed.

The metadata is uploaded so you can modify the test to pull from DJL central rather local now

@zachgk zachgk merged commit 6b448fc into deepjavalibrary:master Jun 16, 2022
@siddvenk siddvenk deleted the movielens-100k branch June 16, 2022 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants