Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Spark v3 #1543

Merged
merged 14 commits into from
Oct 11, 2021
Merged

Upgrade to Spark v3 #1543

merged 14 commits into from
Oct 11, 2021

Conversation

anargyri
Copy link
Collaborator

@anargyri anargyri commented Oct 6, 2021

Description

Changes required to support Spark version 3.
Enabled Python 3.8 because required by Databricks.

Related Issues

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • This PR is being made to staging branch and not to main branch.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Contributor

@laserprec laserprec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Wondering if we have tested for 3.8 support as well?

"pyarrow>=0.8.0,<1.0.0",
"pyspark>=2.4.5,<3.0.0",
"pyarrow>=0.12.1,<6.0.0",
"pyspark>=2.4.5,<4.0.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question here, if a user has pyspark 2 and installs the current notebooks with the current version of MMLSpark, will it work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The databricks_install script will work because it looks up the jar based on the Spark version.
The notebooks that use MMLSPARK variables from spark_utils.py will have to be changed; but it is a reasonably small change for the end user IMO. Perhaps we could have a check of the Spark version in the notebooks too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for users that are using 2.x it is not going to be easy to find out the exact package and repo information. How can we notify the users, should we have a note in the notebook or a note in spark_utils.py? What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a couple of commented out lines in the notebook (that they can uncomment in case of v2) with the explicit mmlspark info for v2? Just for their convenience, since we support only v3.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah makes sense

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included the old info in spark_utils.py in the end.

@@ -118,5 +118,5 @@
install_requires=install_requires,
package_dir={"recommenders": "recommenders"},
packages=find_packages(where=".", exclude=["tests", "tools", "examples"]),
python_requires=">=3.6, <3.8",
python_requires=">=3.6, <3.9", # latest Databricks versions come with Python 3.8 installed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know if all the repo works with 3.8?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could check.

@anargyri
Copy link
Collaborator Author

anargyri commented Oct 7, 2021

Looks good! Wondering if we have tested for 3.8 support as well?

I need to check.

"pyarrow>=0.8.0,<1.0.0",
"pyspark>=2.4.5,<3.0.0",
"pyarrow>=0.12.1,<6.0.0",
"pyspark>=2.4.5,<4.0.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah makes sense

@anargyri
Copy link
Collaborator Author

anargyri commented Oct 8, 2021

So, python 3.8 doesn't work. The [gpu] dependencies cannot be installed with pip (I tried it in a conda env). Sometimes it complains about pytorch, sometimes about tensorflow.
We should leave the Python 3.8 upgrade for another PR.
This makes sense because TF v1.15 doesn't support Python 3.8, see https://www.tensorflow.org/install/source#gpu

@anargyri
Copy link
Collaborator Author

anargyri commented Oct 8, 2021

I have also tested with Java 11 (i.e. keeping the JAVA env variable pointing to the Java 11 home) and the tests pass.
I will amend the text to require Java >=8.

@codecov-commenter
Copy link

Codecov Report

Merging #1543 (8d878aa) into staging (e8fd200) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           staging    #1543   +/-   ##
========================================
  Coverage    62.07%   62.07%           
========================================
  Files           84       84           
  Lines         8492     8492           
========================================
  Hits          5271     5271           
  Misses        3221     3221           
Impacted Files Coverage Δ
recommenders/utils/spark_utils.py 96.15% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2eccb87...8d878aa. Read the comment docs.

@anargyri anargyri merged commit a485784 into staging Oct 11, 2021
@anargyri anargyri deleted the andreas/spark3 branch October 11, 2021 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants