Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing memory leaks in read_csv #23072

Merged
merged 1 commit into from
Nov 19, 2018
Merged

Conversation

zhezherun
Copy link
Contributor

@zhezherun zhezherun commented Oct 10, 2018

This PR fixes a memory leak in parsers.pyx detected by valgrind, and also adds some further cleanup that should avoid memory leaks on exceptions,

closes #21353

  • Moved the allocation of na_hashset further down, closer to where it is used. Otherwise it will not be freed if continue is executed,
  • Delete na_hashset if there is an exception,
  • Also clean up the allocation inside kset_from_list before raising an exception.

@codecov
Copy link

codecov bot commented Oct 10, 2018

Codecov Report

Merging #23072 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23072   +/-   ##
=======================================
  Coverage   92.28%   92.28%           
=======================================
  Files         161      161           
  Lines       51434    51434           
=======================================
  Hits        47467    47467           
  Misses       3967     3967
Flag Coverage Δ
#multiple 90.68% <ø> (ø) ⬆️
#single 42.29% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ab8eb2...36c1104. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Oct 10, 2018

can you run the asv's for csv to see if any effects & a whatsnew note

@jreback jreback added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Oct 10, 2018
@jreback
Copy link
Contributor

jreback commented Oct 10, 2018

can you run the code at the very top of the issue here and show the leak has disappeared.

@zhezherun
Copy link
Contributor Author

zhezherun commented Oct 10, 2018

@kuraga, would you be able to test this change on your reproducer from #21353?

@jreback, not sure what asv's for csv are.

This patch fixes a memory leak in some proprietary code I was working on so I can't post it here, and I don't have a standalone reproducer for my issue, sorry. The reason why I mentioned #21353 was that the leak I was seeing was also coming from read_csv.

@jreback
Copy link
Contributor

jreback commented Oct 10, 2018

@zhezherun i understand, look at the top of the issue. there is a script at the top. pls run that with the new version.

@TomAugspurger
Copy link
Contributor

@zhezherun do you know, will this patch address #19941 as well? Would this memory leak have been exacerbated by multiple threads, or do you think that's a different issue?

@TomAugspurger
Copy link
Contributor

I pulled down this branch and confirmed that it does not fix #19941 (but @zhezherun if you have any guesses on what may be going on there it'd be appreciated).

@jreback jreback added this to the 0.24.0 milestone Oct 11, 2018
@jreback
Copy link
Contributor

jreback commented Oct 11, 2018

@zhezherun can you add a whatsnew note in bug fixes / io section, mentioning the issue number. ping on green.

@jreback
Copy link
Contributor

jreback commented Oct 18, 2018

@zhezherun can you update

@TomAugspurger
Copy link
Contributor

Added a release note. Ping on green.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments, ping on green.

doc/source/whatsnew/v0.24.0.txt Outdated Show resolved Hide resolved
pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved
pandas/_libs/parsers.pyx Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Nov 18, 2018

@gfyoung can you rebase and fix up?

* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes pandas-devgh-21353.
@jreback
Copy link
Contributor

jreback commented Nov 19, 2018

lgtm. ping on green.

@gfyoung
Copy link
Member

gfyoung commented Nov 19, 2018

@jreback : Comments addressed, and all is green. PTAL.

@TomAugspurger TomAugspurger merged commit 3d6d873 into pandas-dev:master Nov 19, 2018
@TomAugspurger
Copy link
Contributor

Thanks all.

thoo added a commit to thoo/pandas that referenced this pull request Nov 19, 2018
…fixed

* upstream/master: (46 commits)
  DEPS: bump xlrd min version to 1.0.0 (pandas-dev#23774)
  BUG: Don't warn if default conflicts with dialect (pandas-dev#23775)
  BUG: Fixing memory leaks in read_csv (pandas-dev#23072)
  TST: Extend datetime64 arith tests to array classes, fix several broken cases (pandas-dev#23771)
  STYLE: Specify bare exceptions in pandas/tests (pandas-dev#23370)
  ENH: between_time, at_time accept axis parameter (pandas-dev#21799)
  PERF: Use is_utc check to improve performance of dateutil UTC in DatetimeIndex methods (pandas-dev#23772)
  CLN: io/formats/html.py: refactor (pandas-dev#22726)
  API: Make Categorical.searchsorted returns a scalar when supplied a scalar (pandas-dev#23466)
  TST: Add test case for GH14080 for overflow exception (pandas-dev#23762)
  BUG: Don't extract header names if none specified (pandas-dev#23703)
  BUG: Index.str.partition not nan-safe (pandas-dev#23558) (pandas-dev#23618)
  DEPR: tz_convert in the Timestamp constructor (pandas-dev#23621)
  PERF: Datetime/Timestamp.normalize for timezone naive datetimes (pandas-dev#23634)
  TST: Use new arithmetic fixtures, parametrize many more tests (pandas-dev#23757)
  REF/TST: Add more pytest idiom to parsers tests (pandas-dev#23761)
  DOC: Add ignore-deprecate argument to validate_docstrings.py (pandas-dev#23650)
  ENH: update pandas-gbq to 0.8.0, adds credentials arg (pandas-dev#23662)
  DOC: Improve error message to show correct order (pandas-dev#23652)
  ENH: Improve error message for empty object array (pandas-dev#23718)
  ...
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes pandas-devgh-21353.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes pandas-devgh-21353.
@wasonkartik
Copy link

Hi, I am facing this issue on google compute engine (Windows Server 2012 R2 Datacenter, 64 bit). How do I fix it? I have installed the latest version of Pandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak in pd.read_csv or DataFrame
5 participants