Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix notebook build failure on Spark 3.2 #1608

Merged
merged 4 commits into from
Jan 18, 2022

Conversation

simonzhaoms
Copy link
Collaborator

@simonzhaoms simonzhaoms commented Jan 13, 2022

Description

Spark 3.2 introduces breaking changes that make code written for Spark version below 3.2 incompatible. This PR modified the notebook to be compatible with Spark 3.2. The dependency that are affected is MMLSpark which is now renamed to SynapseML.
However, the latest SynapseML (SynapseML 0.9.5) do not support Spark version below 3.2. To use Spark 3.0 and 3.1, SynapseML 0.9.4 should be used. This difference is also pointed out in the comments of mmlspark_lightgbm_criteo.ipynb.

Related Issues

See #1553

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • This PR is being made to staging branch and not to main branch.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@simonzhaoms
Copy link
Collaborator Author

simonzhaoms commented Jan 13, 2022

@miguelgfierro This PR looks good now. The error below appears when running on Spark 3.2. According to the instructions in the error, I set to spark.sql.analyzer.failAmbiguousSelfJoin to false to disable the check, but I am not sure whether the check reveals something deeper should be resolved. Please take a look.

AnalysisException:  Column MovieId#1001 are ambiguous. It's probably because you joined several Datasets
together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is
unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before
joining them, and specify the column using qualified name, e.g.
`df.as("a").join(df.as("b"), $"a.id" > $"b.id")`.  You can also set
spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.

The error can be found here.

Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@codecov-commenter
Copy link

codecov-commenter commented Jan 18, 2022

Codecov Report

Merging #1608 (0d23856) into staging (1932d2a) will increase coverage by 58.22%.
The diff coverage is 100.00%.

❗ Current head 0d23856 differs from pull request most recent head 6edbdd2. Consider uploading reports for the commit 6edbdd2 to get more accurate results
Impacted file tree graph

@@             Coverage Diff              @@
##           staging    #1608       +/-   ##
============================================
+ Coverage     0.00%   58.22%   +58.22%     
============================================
  Files           84       84               
  Lines         8462     8462               
============================================
+ Hits             0     4927     +4927     
- Misses           0     3535     +3535     
Flag Coverage Δ
nightly ?
pr-gate 58.22% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
recommenders/utils/spark_utils.py 96.15% <100.00%> (+96.15%) ⬆️
recommenders/datasets/mind.py 0.00% <0.00%> (ø)
recommenders/datasets/movielens.py 69.46% <0.00%> (+69.46%) ⬆️
recommenders/datasets/download_utils.py 90.00% <0.00%> (+90.00%) ⬆️
recommenders/models/newsrec/models/npa.py 95.58% <0.00%> (+95.58%) ⬆️
recommenders/models/newsrec/models/naml.py 92.43% <0.00%> (+92.43%) ⬆️
recommenders/models/newsrec/models/nrms.py 91.37% <0.00%> (+91.37%) ⬆️
recommenders/evaluation/spark_evaluation.py 86.60% <0.00%> (+86.60%) ⬆️
recommenders/models/newsrec/models/lstur.py 87.14% <0.00%> (+87.14%) ⬆️
recommenders/evaluation/python_evaluation.py 93.68% <0.00%> (+93.68%) ⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1932d2a...6edbdd2. Read the comment docs.

Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@miguelgfierro miguelgfierro merged commit 2c3feac into staging Jan 18, 2022
@miguelgfierro miguelgfierro deleted the simonz/build-failure-on-spark3.2 branch January 18, 2022 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants