Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN: Improved thread safety for read_html() GH16928 #16930

Merged
merged 1 commit into from
Jul 21, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,8 @@ I/O

- Bug in :func:`read_stata` where value labels could not be read when using an iterator (:issue:`16923`)

- Bug in :func:`read_html` where import check fails when run in multiple threads (:issue:`16928`)

Plotting
^^^^^^^^
- Bug in plotting methods using ``secondary_y`` and ``fontsize`` not setting secondary axis font size (:issue:`12565`)
Expand Down
4 changes: 2 additions & 2 deletions pandas/io/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,6 @@ def _importers():
if _IMPORTS:
return

_IMPORTS = True

global _HAS_BS4, _HAS_LXML, _HAS_HTML5LIB

try:
Expand All @@ -59,6 +57,8 @@ def _importers():
except ImportError:
pass

_IMPORTS = True


#############
# READ HTML #
Expand Down
36 changes: 35 additions & 1 deletion pandas/tests/io/test_html.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,17 @@
import glob
import os
import re
import threading
import warnings


# imports needed for Python 3.x but will fail under Python 2.x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@3553x : He means to just remove the line. I thought that line would be useful for clarity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is not supported in py2, then you should either conditionally import it, or set reload=None and check it (and skip in a test).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reload is a built-in in Python2. There is no need to import it unless you are using Python3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

try:
from importlib import import_module
from importlib import import_module, reload
except ImportError:
import_module = __import__


from distutils.version import LooseVersion

import pytest
Expand All @@ -22,6 +26,7 @@
from pandas.compat import (map, zip, StringIO, string_types, BytesIO,
is_platform_windows, PY3)
from pandas.io.common import URLError, urlopen, file_path_to_url
import pandas.io.html
from pandas.io.html import read_html
from pandas._libs.parsers import ParserError

Expand Down Expand Up @@ -931,3 +936,32 @@ def test_same_ordering():
dfs_lxml = read_html(filename, index_col=0, flavor=['lxml'])
dfs_bs4 = read_html(filename, index_col=0, flavor=['bs4'])
assert_framelist_equal(dfs_lxml, dfs_bs4)


class ErrorThread(threading.Thread):
def run(self):
try:
super(ErrorThread, self).run()
except Exception as e:
self.err = e
else:
self.err = None


@pytest.mark.slow
def test_importcheck_thread_safety():
Copy link
Member

@gfyoung gfyoung Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reference issue number below the test function definition? Something like "see gh-16928" will suffice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can only test under py3 then use a @pytest.mark.skipf(not compat.PY3, reason=.....) decorator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just tested it with python3 `which pytest` test_html.py::test_importcheck_thread_safety and python2 `which pytest` test_html.py::test_importcheck_thread_safety for both cases (with and without fix). The test seems to work for both versions on my system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if this works, then remove the comment above

# see gh-16928

# force import check by reinitalising global vars in html.py
reload(pandas.io.html)

filename = os.path.join(DATA_PATH, 'valid_markup.html')
helper_thread1 = ErrorThread(target=read_html, args=(filename,))
helper_thread2 = ErrorThread(target=read_html, args=(filename,))

helper_thread1.start()
helper_thread2.start()

while helper_thread1.is_alive() or helper_thread2.is_alive():
pass
assert None is helper_thread1.err is helper_thread2.err