Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gramhagen/wikidata #902

Merged
merged 3 commits into from
Sep 6, 2019
Merged

Gramhagen/wikidata #902

merged 3 commits into from
Sep 6, 2019

Conversation

gramhagen
Copy link
Collaborator

Description

clean up to wikidata notebook and utils, this does speed up the notebook execution a bit (first data pull went from 8s -> 5s), the longer movielens data pull should be faster too (also i just clipped it to 50 by default)

@almudenasanz it would be great to get your feedback here. I hid some of the functionality to make it easier to reuse the code, but if you think it's important to surface the functions to get the entities, links, and descriptions we can add that back into the notebook.

Related Issues

#880 it's possible that this might help (mainly due to session caching?) I did limit some of the results in the normal case, but that shouldn't impact the integration test.

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.

@review-notebook-app
Copy link

Check out this pull request on ReviewNB: https://app.reviewnb.com/microsoft/recommenders/pull/902

You'll be able to see notebook diffs and discuss changes. Powered by ReviewNB.

@gramhagen gramhagen changed the base branch from master to staging August 22, 2019 03:59
@almudenasanz
Copy link
Collaborator

Description

clean up to wikidata notebook and utils, this does speed up the notebook execution a bit (first data pull went from 8s -> 5s), the longer movielens data pull should be faster too (also i just clipped it to 50 by default)

@almudenasanz it would be great to get your feedback here. I hid some of the functionality to make it easier to reuse the code, but if you think it's important to surface the functions to get the entities, links, and descriptions we can add that back into the notebook.

Related Issues

#880 it's possible that this might help (mainly due to session caching?) I did limit some of the results in the normal case, but that shouldn't impact the integration test.

Thanks a lot @gramhagen ! Very nice to see how you handled the sessions.

Maybe we could surface one example of the steps on how to get the Wikidata ID and the Links as different steps, since we query different APIs and some people may want to just use one or another example. Eg: just get the wikidata ID from a text query, or from a wikidata ID get the related entities or description

@gramhagen
Copy link
Collaborator Author

Makes sense. We can show the steps in the first example and then use the helper function later. I'll update the notebook.

Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@miguelgfierro
Copy link
Collaborator

weird error in pyspark:


tests/unit/test_notebooks_pyspark.py ....F.                              [100%]

=================================== FAILURES ===================================
______________________________ test_spark_tuning _______________________________

notebooks = {'als_deep_dive': '/data/home/recocat/cicd/4/s/notebooks/02_model/als_deep_dive.ipynb', 'als_pyspark': '/data/home/rec...baseline_deep_dive.ipynb', 'data_split': '/data/home/recocat/cicd/4/s/notebooks/01_prepare_data/data_split.ipynb', ...}

    @pytest.mark.notebooks
    @pytest.mark.spark
    def test_spark_tuning(notebooks):
        notebook_path = notebooks["spark_tuning"]
        pm.execute_notebook(
            notebook_path,
            OUTPUT_NOTEBOOK,
            kernel_name=KERNEL_NAME,
            parameters=dict(
                NUMBER_CORES="*",
                NUMBER_ITERATIONS=3,
                SUBSET_RATIO=0.5,
                RANK=[5, 5],
>               REG=[0.1, 0.01]
            )
        )

tests/unit/test_notebooks_pyspark.py:51: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/anaconda/envs/reco_pyspark/lib/python3.6/site-packages/papermill/execute.py:94: in execute_notebook
    raise_for_execution_errors(nb, output_path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

nb = {'cells': [{'cell_type': 'code', 'metadata': {'inputHidden': True, 'hide_input': True}, 'execution_count': None, 'sour...nd_time': '2019-09-05T13:15:25.886974', 'duration': 21.609476, 'exception': True}}, 'nbformat': 4, 'nbformat_minor': 2}
output_path = 'output.ipynb'

    def raise_for_execution_errors(nb, output_path):
        """Assigned parameters into the appropriate place in the input notebook
    
        Parameters
        ----------
        nb : NotebookNode
           Executable notebook object
        output_path : str
           Path to write executed notebook
        """
        error = None
        for cell in nb.cells:
            if cell.get("outputs") is None:
                continue
    
            for output in cell.outputs:
                if output.output_type == "error":
                    error = PapermillExecutionError(
                        exec_count=cell.execution_count,
                        source=cell.source,
                        ename=output.ename,
                        evalue=output.evalue,
                        traceback=output.traceback,
                    )
                    break
    
        if error:
            # Write notebook back out with the Error Message at the top of the Notebook.
            error_msg = ERROR_MESSAGE_TEMPLATE % str(error.exec_count)
            error_msg_cell = nbformat.v4.new_code_cell(
                source="%%html\n" + error_msg,
                outputs=[
                    nbformat.v4.new_output(output_type="display_data", data={"text/html": error_msg})
                ],
                metadata={"inputHidden": True, "hide_input": True},
            )
            nb.cells = [error_msg_cell] + nb.cells
            write_ipynb(nb, output_path)
>           raise error
E           papermill.exceptions.PapermillExecutionError: 
E           ---------------------------------------------------------------------------
E           Exception encountered at "In [11]":
E           ---------------------------------------------------------------------------
E           Py4JJavaError                             Traceback (most recent call last)
E           /anaconda/envs/reco_pyspark/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
E                62         try:
E           ---> 63             return f(*a, **kw)
E                64         except py4j.protocol.Py4JJavaError as e:
E           
E           /anaconda/envs/reco_pyspark/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
E               327                     "An error occurred while calling {0}{1}{2}.\n".
E           --> 328                     format(target_id, ".", name), value)
E               329             else:
E           
E           <class 'str'>: (<class 'py4j.protocol.Py4JNetworkError'>, Py4JNetworkError('An error occurred while trying to connect to the Java server (127.0.0.1:35421)',))
E           

maybe a problem with the spark instantiation?

@miguelgfierro miguelgfierro merged commit 8556a7f into staging Sep 6, 2019
@miguelgfierro miguelgfierro deleted the gramhagen/wikidata branch September 6, 2019 11:33
yueguoguo pushed a commit that referenced this pull request Sep 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants