Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html() Thread Safety #16928

Closed
3553x opened this issue Jul 14, 2017 · 0 comments · Fixed by #16930
Closed

read_html() Thread Safety #16928

3553x opened this issue Jul 14, 2017 · 0 comments · Fixed by #16930
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Multithreading Parallelism in pandas
Milestone

Comments

@3553x
Copy link
Contributor

3553x commented Jul 14, 2017

Code Sample

#!/usr/bin/python3
import pandas
import threading

def fetch_file():
    url = "https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html"
    pandas.read_html(url)

thread1 = threading.Thread(target = fetch_file)
thread2 = threading.Thread(target = fetch_file)

thread1.start()
thread2.start()

Output

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "./pandas_bug.py", line 7, in fetch_file
    pandas.read_html(url)
  File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 904, in read_html
    keep_default_na=keep_default_na)
  File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 731, in _parse
    parser = _parser_dispatch(flav)
  File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 691, in _parser_dispatch
    raise ImportError("lxml not found, please install it")
ImportError: lxml not found, please install it

Problem description

read_html() doesn't appear to be multi-threading safe. This specific issue seems to be caused by setting _IMPORTS in html.py to True too early resulting in the second thread entering _parser_dispatch and throwing an exception while the first thread hasn't finished the check.

I have written a potential fix and will open a PR shortly.

Expected Output

No exception should be thrown since lxml is installed and the program works fine without multi-threading.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.11.3-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback jreback added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Jul 14, 2017
@gfyoung gfyoung added the Multithreading Parallelism in pandas label Jul 14, 2017
3553x added a commit to 3553x/pandas that referenced this issue Jul 16, 2017
@jreback jreback added this to the 0.21.0 milestone Jul 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Multithreading Parallelism in pandas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants