Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large refactor #1086

Merged
merged 67 commits into from
Jun 29, 2020
Merged

Large refactor #1086

merged 67 commits into from
Jun 29, 2020

Conversation

miguelgfierro
Copy link
Collaborator

Description

Refactoring repo

Related Issues

#810

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • This PR is being made to staging and not master.

Copy link
Collaborator

@yueguoguo yueguoguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. BURN!

scenarios/COLD_START.md Outdated Show resolved Hide resolved
@miguelgfierro
Copy link
Collaborator Author

miguelgfierro commented May 21, 2020

TODO:

  • All general parts are in examples, for example when we discuss about the metrics, we link the notebook that explains them in examples folder
  • Finish retail.md and review again with the team

scenarios/retail/README.md Outdated Show resolved Hide resolved
@miguelgfierro
Copy link
Collaborator Author

miguelgfierro commented Jun 22, 2020

with "coordinates": "Azure:mmlspark:0.17"
notebook: examples/02_model_content_based_filtering/mmlspark_lightgbm_criteo.ipynb

pytest tests/smoke -m "smoke and spark and not gpu" --durations 0 --disable-warnings
========================================================================= short test summary info =========================================================================
FAILED tests/smoke/test_notebooks_pyspark.py::test_mmlspark_lightgbm_criteo_smoke - papermill.exceptions.PapermillExecutionError:
=================================================== 1 failed, 3 passed, 26 deselected, 53 warnings in 100.11s (0:01:40) 



    def test_mmlspark_lightgbm_criteo_smoke(notebooks):
        notebook_path = notebooks["mmlspark_lightgbm_criteo"]
        pm.execute_notebook(
            notebook_path,
            OUTPUT_NOTEBOOK,
            kernel_name=KERNEL_NAME,
>           parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
        )

tests/smoke/test_notebooks_pyspark.py:46:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/anaconda/envs/reco_full/lib/python3.6/site-packages/papermill/execute.py:100: in execute_notebook
    raise_for_execution_errors(nb, output_path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

nb = {'cells': [{'cell_type': 'code', 'metadata': {'inputHidden': True, 'hide_input': True}, 'execution_count': None, 'sour...nd_time': '2020-06-22T10:29:38.745547', 'duration': 33.198558, 'exception': True}}, 'nbformat': 4, 'nbformat_minor': 2}
output_path = 'output.ipynb'

    def raise_for_execution_errors(nb, output_path):
        """Assigned parameters into the appropriate place in the input notebook

        Parameters
        ----------
        nb : NotebookNode
           Executable notebook object
        output_path : str
           Path to write executed notebook
        """
        error = None
        for cell in nb.cells:
            if cell.get("outputs") is None:
                continue

            for output in cell.outputs:
                if output.output_type == "error":
                    error = PapermillExecutionError(
                        exec_count=cell.execution_count,
                        source=cell.source,
                        ename=output.ename,
                        evalue=output.evalue,
                        traceback=output.traceback,
                    )
                    break

        if error:
            # Write notebook back out with the Error Message at the top of the Notebook.
            error_msg = ERROR_MESSAGE_TEMPLATE % str(error.exec_count)
            error_msg_cell = nbformat.v4.new_code_cell(
                source="%%html\n" + error_msg,
                outputs=[
                    nbformat.v4.new_output(output_type="display_data", data={"text/html": error_msg})
                ],
                metadata={"inputHidden": True, "hide_input": True},
            )
            nb.cells = [error_msg_cell] + nb.cells
            write_ipynb(nb, output_path)
>           raise error
E           papermill.exceptions.PapermillExecutionError:
E           ---------------------------------------------------------------------------
E           Exception encountered at "In [9]":
E           ---------------------------------------------------------------------------
E           Py4JJavaError                             Traceback (most recent call last)
E           <ipython-input-9-2f94d6c0254d> in <module>
E           ----> 1 model = lgbm.fit(train)
E                 2 predictions = model.transform(test)
E
E           /anaconda/envs/reco_full/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
E               130                 return self.copy(params)._fit(dataset)
E               131             else:
E           --> 132                 return self._fit(dataset)
E               133         else:
E               134             raise ValueError("Params must be either a param map or a list/tuple of param maps, "
E
E           /anaconda/envs/reco_full/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
E               293
E               294     def _fit(self, dataset):
E           --> 295         java_model = self._fit_java(dataset)
E               296         model = self._create_model(java_model)
E               297         return self._copyValues(model)
E
E           /anaconda/envs/reco_full/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
E               289         :return: fitted Java model
E               290         """
E           --> 291         self._transfer_params_to_java()
E               292         return self._java_obj.fit(dataset._jdf)
E               293
E
E           /anaconda/envs/reco_full/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _transfer_params_to_java(self)
E               122         for param in self.params:
E               123             if self.isSet(param):
E           --> 124                 pair = self._make_java_param_pair(param, self._paramMap[param])
E               125                 self._java_obj.set(pair)
E               126             if self.hasDefault(param):
E
E           /anaconda/envs/reco_full/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _make_java_param_pair(self, param, value)
E               111         sc = SparkContext._active_spark_context
E               112         param = self._resolveParam(param)
E           --> 113         java_param = self._java_obj.getParam(param.name)
E               114         java_value = _py2java(sc, value)
E               115         return java_param.w(java_value)
E
E           /anaconda/envs/reco_full/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
E              1255         answer = self.gateway_client.send_command(command)
E              1256         return_value = get_return_value(
E           -> 1257             answer, self.gateway_client, self.target_id, self.name)
E              1258
E              1259         for temp_arg in temp_args:
E
E           /anaconda/envs/reco_full/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
E                61     def deco(*a, **kw):
E                62         try:
E           ---> 63             return f(*a, **kw)
E                64         except py4j.protocol.Py4JJavaError as e:
E                65             s = e.java_exception.toString()
E
E           /anaconda/envs/reco_full/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
E               326                 raise Py4JJavaError(
E               327                     "An error occurred while calling {0}{1}{2}.\n".
E           --> 328                     format(target_id, ".", name), value)
E               329             else:
E               330                 raise Py4JError(
E
E           Py4JJavaError: An error occurred while calling o106.getParam.
E           : java.util.NoSuchElementException: Param boostFromAverage does not exist.
E               at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
E               at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
E               at scala.Option.getOrElse(Option.scala:121)
E               at org.apache.spark.ml.param.Params$class.getParam(params.scala:728)
E               at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42)
E               at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
E               at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
E               at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E               at java.lang.reflect.Method.invoke(Method.java:498)
E               at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
E               at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
E               at py4j.Gateway.invoke(Gateway.java:282)
E               at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
E               at py4j.commands.CallCommand.execute(CallCommand.java:79)
E               at py4j.GatewayConnection.run(GatewayConnection.java:238)
E               at java.lang.Thread.run(Thread.java:748)

/anaconda/envs/reco_full/lib/python3.6/site-packages/papermill/execute.py:248: PapermillExecutionError

the error Param boostFromAverage does not exist. is weird, looking at the signature of LightGBMClassifier it has the parameter boostFromAverage=True.

Maven repository has changed from https://mvnrepository.com/artifact/Azure/mmlspark?repo=spark-packages to https://mvnrepository.com/artifact/com.microsoft.ml.spark/mmlspark. If I change "coordinates": "Azure:mmlspark:0.17" to "coordinates": "com.microsoft.ml.spark:mmlspark_2.11:0.18.1". Then I still get an error:

from mmlspark.train import ComputeModelStatistics
from mmlspark.lightgbm import LightGBMClassifier

model = lgbm.fit(train)


---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-11-2c2d97ba8c1c> in <module>
----> 1 model = lgbm.fit(train)

/anaconda/envs/reco_full/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
    130                 return self.copy(params)._fit(dataset)
    131             else:
--> 132                 return self._fit(dataset)
    133         else:
    134             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

/anaconda/envs/reco_full/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
    293 
    294     def _fit(self, dataset):
--> 295         java_model = self._fit_java(dataset)
    296         model = self._create_model(java_model)
    297         return self._copyValues(model)

/anaconda/envs/reco_full/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
    290         """
    291         self._transfer_params_to_java()
--> 292         return self._java_obj.fit(dataset._jdf)
    293 
    294     def _fit(self, dataset):

/anaconda/envs/reco_full/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/anaconda/envs/reco_full/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/anaconda/envs/reco_full/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o115.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 21, localhost, executor driver): java.lang.UnsatisfiedLinkError: com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_NetworkFree()I
	at com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_NetworkFree(Native Method)
	at com.microsoft.ml.lightgbm.lightgbmlib.LGBM_NetworkFree(lightgbmlib.java:209)
	at com.microsoft.ml.spark.lightgbm.TrainUtils$.trainLightGBM(TrainUtils.scala:415)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85)
	at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
	at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)

TODO: hint from @gramhagen, try to downgrade to spark 2.4.3 and try again

*SOLUTION
Have mmlspark 0.18.1 with spark 2.4.3 and pyspark 2.4.3 (spark was downloaded from https://archive.apache.org/dist/spark/spark-2.4.3/ by and added to /dsvm/tools/spark)

cs /dsvm/tools/spark
sudo wget https://archive.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
sudo tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
sudo rm current
sudo ln -s spark-2.4.3-bin-hadoop2.7 current

@miguelgfierro miguelgfierro mentioned this pull request Jun 23, 2020
4 tasks
@yueguoguo
Copy link
Collaborator

@miguelgfierro I will commit scenarios of "research recommendation" directly into the miguel/burn_and_destroy branch. Does it block anything?

@miguelgfierro
Copy link
Collaborator Author

@miguelgfierro I will commit scenarios of "research recommendation" directly into the miguel/burn_and_destroy branch. Does it block anything?

perfect

@miguelgfierro miguelgfierro merged commit c82807d into staging Jun 29, 2020
@miguelgfierro miguelgfierro deleted the miguel/burn_and_destroy branch June 29, 2020 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants