Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use conda.api instead of parallel calls to the conda binary #4775

Merged
merged 6 commits into from
Jan 11, 2021

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Jan 7, 2021

Currently, our min_deps_check.py script uses the conda binary to analyze the package metadata. This is not very efficient because conda will download and parse repodata.json on each call. To speed up the script, we're currently running 8 of these calls (+ processing the results) in parallel using a ThreadPoolExecutor, but that increases the memory consumption (and my old laptop does not have that much memory).

conda provides the conda.api.SubdirData.query_all function which will cache the parsed repodata.json files between calls. Using that, my laptop can complete

python ci/min_deps_check.py ci/requirements/py36-bare-minimum.yml
python ci/min_deps_check.py ci/requirements/py36-min-all-deps.py

in about 30 seconds, even though the packages are analyzed sequentially.

The documentation states that that function maybe be changed without warning between minor versions, so we would have to pin conda to a x.y version to be able to use this.

Edit: the documentation does have that warning, but it also says:

There are 3 supported public modules. We support:

  • import conda.cli.python_api
  • import conda.api
  • import conda.exports

The first 2 should have very long-term stability. The third is guaranteed to be stable throughout the lifetime of a feature release series--i.e. minor version number.

so I guess we don't have to pin?

cc @crusaderky

By default, the upstream dev CI is disabled on pull request and push events. You can override this behavior per commit by adding a [test-upstream] tag to the first line of the commit message.

@keewis keewis marked this pull request as ready for review January 7, 2021 14:05
@keewis keewis changed the title WIP: use conda.api instead of parallel calls to the conda binary use conda.api instead of parallel calls to the conda binary Jan 7, 2021
@keewis
Copy link
Collaborator Author

keewis commented Jan 8, 2021

the new code should generate the same report as the old code now, so this should be ready for review.

The tool says we can bump a few libraries (for example we could bump numpy to 1.17 in a few days), but I guess we should resolve #4179 first.

@mathause
Copy link
Collaborator

mathause commented Jan 8, 2021

LGTM

@keewis keewis requested a review from crusaderky January 8, 2021 14:19
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks!

Copy link
Contributor

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great

@crusaderky crusaderky merged commit db6f4be into pydata:master Jan 11, 2021
@keewis keewis deleted the refactor-min_deps_check branch January 11, 2021 11:46
dcherian added a commit to TomNicholas/xarray that referenced this pull request Jan 18, 2021
* upstream/master: (342 commits)
  fix decode for scale/ offset list (pydata#4802)
  Expand user dir paths (~) in open_mfdataset and to_zarr. (pydata#4795)
  add a version info step to the upstream-dev CI (pydata#4815)
  fix the ci trigger action (pydata#4805)
  scatter plot by order of the first appearance of hue (pydata#4723)
  don't skip the scheduled CI (pydata#4806)
  coords: retain str dtype (pydata#4759)
  Fix interval labels with units (pydata#4794)
  Always force dask arrays to float in missing.interp_func (pydata#4771)
  Print number of variables in repr (pydata#4762)
  install conda as a library in the minimum dependency check CI (pydata#4792)
  Migrate CI from azure pipelines to GitHub Actions (pydata#4730)
  use conda.api instead of parallel calls to the conda binary (pydata#4775)
  Speed up missing._get_interpolator (pydata#4776)
  Remove special case in guess_engines (pydata#4777)
  improve typing of OrderedSet (pydata#4774)
  CI: ignore some warnings (pydata#4773)
  DOC: update hyperlink for xskillscore (pydata#4778)
  drop support for python 3.6 (pydata#4720)
  Trigger upstream CI on cron schedule (by default) (pydata#4729)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants