Parallel downloads #825

cool-RR · 2013-03-04T20:00:17Z

How about having pip download all the packages in parallel instead of waiting for each one to finish before downloading the next?

After that is implemented, how about having pip start installing one package while it's downloading the next ones?

dstufft · 2014-01-20T17:23:53Z

Yes we do care. However it's largely a problem of there being bigger issues to tackle first. Sorry that nobody responded to your ticket though!

cool-RR · 2014-01-20T17:34:11Z

If we're looking at parallelizing just the download part, is it much more
complex than sticking a concurrent.futures.ThreadPoolExecutor on it and
adding a configuration option to turn it off?

On Mon, Jan 20, 2014 at 7:23 PM, Donald Stufft notifications@github.comwrote:

Yes we do care. However it's largely a problem of there being bigger
issues to tackle first. Sorry that nobody responded to your ticket though!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/825#issuecomment-32779319
.

thedrow · 2014-01-21T11:40:18Z

Parallelizing disk writes can actually make downloading slower on HDDs since the needle skips from one place to another. You'll have to store the entire downloaded content in memory and extract it from there in order to actually gain something. It's not that easy. As far as I know this is not the case with SSDs.

Also, depending on your network's throughput downloading in parallel might actually slow down the installation process.

I'm not sure if this is a good idea.

Downloading and installing in parallel might be a good idea though as long as you download into memory so there's only one concurrent disk write at the time. Again see the note about SSDs.

jquast · 2014-06-23T20:21:33Z

I disagree. Most downloads are far slower than a disk can write, even if writing many files. As an optional --feature, I think it is a trival choice -- people who know they are downloading over a LAN and onto a slow disk would chose not to --enable such feature. People like myself downloading several-hundred megabyte packages over broadband would gain much to chose to use such --feature.

sholsapp · 2014-11-18T23:19:25Z

+1 to parallel download (and install if possible).

mattvonrocketstein · 2015-10-01T20:51:11Z

+1, I frequently work on projects that have hundreds of dependencies and are built from scratch repeatedly throughout the average workday. This stuff crawls even with a pypi mirror on the LAN so having this would be great! http://stackoverflow.com/questions/11021130/parallel-pip-install Options presented are not great, but it does show there is (and has been) interest in a solution to the issue

jamesob · 2015-10-20T00:28:40Z

+1, this would be huge. Willing to take a crack at implementing.

mattvonrocketstein · 2015-10-20T01:05:28Z

If anyone is curious why this should be a core feature and not something an external tool provides I can summarize the problem. Using something like "xargs --max-args=1 --max-procs=4 sudo pip install < requires.txt" in general can actually increase the total downloads quite a bit. This happens when for example "django-foo" and "django-bar" both require Django, but specify different constraints for which version is acceptable. I think pip itself has to compute the unified requirement set (probably including requirements-of-requirements) to avoid redundant downloads.

sholsapp · 2015-10-20T15:43:46Z

Do knowledgeable people know if parallel install is an option? When I tried implementing a build system that executed pip in parallel, I hit problems specifically related to installing modules that had native extensions, but didn't dig further into it. If people know what might be wrong there, I'd love to know.

fillest · 2016-04-27T08:33:18Z

+1 to parallelization. Not just for downloading but also for things like "Collecting ... File was already downloaded" with pip wheel. It is painfully slow for projects with huge dependency list and multiple hosts, it sometimes becomes one of deployment bottlenecks. I had to add a cache layer on our project - run it only if requirements.txt has changed.

RonnyPfannschmidt · 2016-04-27T08:59:42Z

note that enabling complete "parallelism" amounts to a relatively complete re-factoring of many internals
this needs an idea for a starting point and quite many re-factoring steps over time

also there is ux issues, while downloading/unpacking is simple, wheel generation and setup.py invocation are full of error potentials/conditions (however they are needed until everybody uses wheels whenever possible and they are needed right in the middle of execution)

mattvonrocketstein · 2016-04-27T17:01:08Z

so if the UX complication is sidestepped with only parallel downloads/unpacking, doesn't it make sense for that to be the first goal? full parallelism does seem like a huge undertaking.. i'm no expert but as far as that implementation here's two ideas that came to mind

a) in the set of all the requirements and the requirements of requirements, subsets must be discovered wherein individual requirements must be installed sequentially. all such subsets can then be installed in parallel at least. as far as ux, errors here can be shown whenever they occur because any error encountered is truly an error.

b) in the set of all the requirements and the requirements of requirements, a parallel worker pops something out and tries installing it. if the req installs successfully, great. if it fails, we assume another requirement must be installed first and put it back in. as far as ux, errors here must be suppressed unless they persist until the end of this procedure, because any error encountered might be something we can recover from later in the process.

RonnyPfannschmidt · 2016-04-28T06:41:40Z

the ux issue cannot be side-stepped if any of the downloaded packages is a sdist

the requirement graph is nonlinear and changes whenever more about the dependencies and constrains of a just freshly downloaded package gets known

as soon as a sdist is downloaded, the build/egg-info/wheel process of it has to be triggered

thedrow · 2016-09-20T09:13:30Z

I just took a stab at it. It seems that the progress bar/spinner is an issue since it registers a signal handler in order to clean up downloads.
I'm not sure how to resolve it yet.

pradyunsg · 2017-05-16T20:37:51Z

Here's what I think...

To be able to have parallel downloads, pip would need to determine which packages are to be installed and where they need to be fetched from. As of today, it is not possible since pip cannot determine the dependencies of a package without downloading it. ~~While there's a bunch of interesting p~~

That said, this is definitely something that'll be pretty awesome to have. :)

fruechel · 2017-05-16T22:31:27Z

I understand that you won't know follow-up dependencies before downloading a package. But looking at a requirements.txt, I would assume you could start a parallel download of all of those packages and then expand the list of things to download as you discover more dependencies. The only edge case I can think of right now would be different version requirements on the same dependency. But that problem exists with regular downloading as well I would assume.

pradyunsg · 2017-05-17T02:32:36Z

I didn't mean to post that message yet. Oops.

Anyway, the point I was making was that the way pip currently handles dependencies, there's a race condition where 2 packages depend on a common third with compatible but different version specifiers. Whichever package is downloaded first, its specifier would be used and a bad thing - you have behaviour that changes because of how the network behaved.

The only right way to do this is then to have dependencies metadata on pypi and only then can we determine the packages beforehand and proceed to parallel download/installation. Or somehow managing this during downloading?

Or I'm missing something about this issue.

pradyunsg · 2017-05-17T02:36:02Z

@fruechel Yes. Except with the serial downloads, the version of the common dependency is deterministic.

fruechel · 2017-05-17T06:48:23Z

Understood. You're right, it would make the process non-deterministic, based on random network behaviour. So yeah until that issue is resolved, implementing this would introduce problematic behaviour that you wouldn't want to build in.

mattvonrocketstein · 2017-05-17T08:22:58Z

there's a race condition where 2 packages depend on a common third with compatible but different version specifiers. Whichever package is downloaded first, its specifier would be used

This problem would seem to be limited to parallel installation. If we're talking strictly about parallel downloads followed by standard, serial installs.. you might experience an extra, useless download.. but I'm not clear on why it should introduce anything nondeterministic.

I see a few potentially different scenarios for this improvement, and maybe it's useful to avoid conflating them:

parallel requirements downloads (single-level),
parallel requirements downloads (nested requirements),
parallel installation of requirements (as individual downloads are completed)
parallel installation of requirements (in some second pass, after all downloads are complete)

Of course having several of these things would be awesome, but any could be an improvement. In this thread people have raised at least 3 separate blockers from what I can tell:

missing requirements metadata
prerequisite but large-scale refactors of existing code
introducing nondeterminism

I'm less clear on which blockers affect which scenarios

AraHaan · 2017-05-17T13:02:20Z

Hmmm this is where class dicts with an entry of the downloaded dependencies names would work so then it can bite the issue about extra downloads. And then the install system can look at that dict that is only populated at run time to install all the packages in that dict. And the dict itself on every entry could have a class that stores the information needed for the install mechanism in pip. I think this can theoretically be used for parallel installs.

ghost · 2017-07-31T18:06:11Z

What is probably the most feasible is to download all requirements in parallel until reaching a source distribution, and then download in serial. Something like:

wg = WaitGroup()
while True:
    if req is Wheel:
        wg.Add(1)
        req.download_async()
    else:
        wg.Done()
        req.download()

pradyunsg · 2017-08-20T06:21:27Z

I've labelled this issue as an "deferred till PR".

This label is essentially for indicating that further discussion related to this issue should be deferred until someone comes around to make a PR. This does not mean that the said PR would be accepted - ~~it has not been determined whether this is a useful change to pip and that decision has been deferred until the PR is made.~~

pradyunsg · 2017-08-20T06:22:32Z

That said, I think it's pretty much clear that this is would be a welcome improvement. :)

pradyunsg · 2019-06-05T01:38:37Z

Let's stick to the topic of adding parallel downloads to pip here.

If there are questions about the broader picture or how pip does vendoring etc, please file a new issue / start a thread on the Discourse forum etc. I've gone ahead and hidden all comments unrelated to the topic at hand.

samuelcolvin · 2019-06-05T10:28:51Z

The issue I created on packaging-problems pypa/packaging-problems#261 was related to parallel compiling. For me (and I imagine many other users) compiling taking way way longer than downloading.

This is particularly true:

on platforms where manylinux binaries don't work, eg. alpine
when installing packages in the cloud where one generally has a very fast connection but a relatively slow effective clock speed.

@pradyunsg due you consider this a separate concern? If so I'll create a new issue.

I guess in an ideal world pip would utilise asyncio for download, and run_in_executor with a pool of processes to compile. But we don't live in an ideal world.

In summary the question for me from this thread are:

where can the maximum speedup be achieved for the maximum number of people? download or compile
is parallel compiling closely enough related to parallel download to be implemented together? (and therefore tracked under the same issue)
would it be easier/more productive to build a new tool for fast installation that doesn't have the same limitations as pip? So you'd effectively do pip install fast-pip; fast-pip install -r requirements.txt.... This might make progress quicker but would leave those still using pip in the slow lane

KOLANICH · 2019-06-05T14:54:26Z

I guess in an ideal world pip would utilise asyncio for download, and run_in_executor with a pool of processes to compile.

IMHO precompiled wheels are better solution. Compiling C++ stuff is a job of CMake and ninja, not pip.

samuelcolvin · 2019-06-05T15:03:05Z

That's a completely different issue, I entirely agree that pre-compiled binaries would be great - but they're not at all easy apparently, see pypa/manylinux#37 and pypa/packaging-problems#69.

My suggestion is a work around that makes use of multiple cores to speed up installing/compiling many packages by ~10x.

Maybe I wasn't clear, I'm not suggesting a single package is installed using multiple threads or processes, but rather that packages can be compiled in parallel, so if I do (on alpine) pip install uvloop aiohttp pydantic cryptography pycryptodome asyncpg pycares I can use more than one core to compile those packages.

pfmoore · 2019-06-05T15:16:14Z

@samuelcolvin I know it's "just" a workaround, but couldn't you do something like

parallel pip wheel {} ::: uvloop aiohttp pydantic cryptography pycryptodome asyncpg pycares
pip install *.whl

Using GNU Parallel to run multiple commands in parallel - excuse me if I got the syntax wrong, I'm not a Unix user. Basically, build wheels in parallel, and then pip install them. If you have multiple machines that have the same architecture/are binary compatible, you can re-use the .whl files (pre-compiled binaries are only hard when you're distributing to machines you don't know are compatible).

samuelcolvin · 2019-06-05T15:16:28Z

If it helps to demonstate the problem, try build the following Dockerfile with docker build .:

FROM python:3.7-alpine3.8

RUN apk add -U gcc g++ musl-dev zlib-dev libuv libffi-dev make openssl-dev

RUN pip install -U pip setuptools
RUN pip install uvloop aiohttp pydantic cryptography pycryptodome asyncpg pycares

And here's a snapshot from me running it:

You get a pretty clear idea of how little time is spent downloading compared to compiling and how little of the CPU resources are being utilised for a CPU bound task...

For me this took 137seconds, of which about 130seconds was the last line.

samuelcolvin · 2019-06-05T15:18:32Z

Using GNU Parallel to run multiple commands in parallel - excuse me if I got the syntax wrong, I'm not a Unix user.

Interesting idea, that might help a lot, the main problem is that multiple packages might have the same dependencies which would then be installed multiple times, even worse they might have conflicting dependencies which pip wouldn't know about in this case.

pfmoore · 2019-06-05T15:53:53Z

multiple packages might have the same dependencies which would then be installed multiple times, even worse they might have conflicting dependencies which pip wouldn't know about in this case.

Not if you pip install all the wheels in one command once they are built.

samuelcolvin · 2019-06-05T16:38:54Z

Good point, sorry.

For me that reduced the install time in the above example from 134s to 54s.

pradyunsg · 2019-06-06T04:43:18Z

@pradyunsg due you consider this a separate concern? If so I'll create a new issue.

Yep. Even if we're going to solve them together, it's useful to have separation of concerns and discussion.

where can the maximum speedup be achieved for the maximum number of people? download or compile

is parallel compiling closely enough related to parallel download to be implemented together? (and therefore tracked under the same issue)

would it be easier/more productive to build a new tool for fast installation that doesn't have the same limitations as pip? So you'd effectively do pip install fast-pip; fast-pip install -r requirements.txt.... This might make progress quicker but would leave those still using pip in the slow lane

I don't know. Someone has to do the work of figuring out what the scope of things is and actually spend time implementing this stuff.

Since no one has stepped up to do it yet, and the volunteered time from the maintainers isn't going to be all pulled into this issue, I don't think that we'll have the answers soon either.

thedrow · 2019-06-06T14:07:35Z

I have a PR which works and shortens installation time.
The only thing I haven't figured out is the UX/UI.
If someone can figure it out, we have something that kinda works (there's no SAT resolver which can tell us which packages we should download though).

pradyunsg · 2019-06-06T15:25:34Z

@thedrow please feel free to file a PR and we can have more concrete discussions on it. :)

thedrow · 2019-06-08T17:45:15Z

@pradyunsg I did a while ago...
See #3981 and #4654.

jsar3004 · 2020-07-02T10:58:28Z

I am beginner to open-source please help what technologies should I learn so that I can do this project?

McSinyx · 2020-07-02T13:32:57Z

Hi @jsar3004, I'm working on this with pip's maintainers and contributors this summer. Turns out, it is not a trivial task as it touches many corner that I never thought that it would. FYI the current work is implementing GH-7819 and if you want to help out, feel free to join the discussion.

If you are really beginning to contributing free software development, it's more than just technical skills and understanding that you should possess. For medium-large-sized project like pip, a decent amount of time is actually spent in discussion of all level (from idea to nitpicking)—trust me I'm still learning on how to do it properly. Personally I suggest starting from the list of issues awaiting PRs and see what you're also interested in. Increasing the scope a bit, I believe others project under the hood of PyPA also welcome people looking to contribute.

KOLANICH · 2020-07-02T14:07:43Z

You may wanna look at https://github.com/KOLANICH/fetchers.py/blob/master/fetchers/tools/download.py#L16

pradyunsg · 2020-07-02T15:02:31Z

I am beginner to open-source please help what technologies should I learn so that I can do this project?

@jsar3004 Hi! I suggest you get started by looking at other tasks, such as good first issues in pip and [other PyPA tooling](org:pypa label:"good first issue"). This is a significantly large task, with multiple moving parts -- it would not be possible to work on this without an initial ramp-up, gaining a better understanding of how pip works and what pip's development workflows are.

cosmicexplorer · 2024-08-19T17:09:26Z

Hey, I've created a PR #12923 to do this by extending the BatchDownloader already developed by @McSinyx. It's not quite ready for primetime yet, as I'm not sure how to support the progress bar. The BatchDownloader interface (producing an iterable of results) made it super easy to install in parallel with downloads as requested, but the parallelism will be limited to "metadata-only" dists, which require the rest of the work in #12921 to be available for the majority of pip download targets.

Please ask questions/comments in #12921 regarding metadata resolves, and #12923 regarding the parallel download strategy (I just used threads instead of any fancy async stuff). I haven't figured out the progress bar yet (although I'm going to try to do so now), so please feel free to leave comments in #12923 regarding how to support a unified progress bar for parallel downloads.

This comment has been minimized.

Sign in to view

xavfernandez added the type: enhancement Improvements to functionality label Oct 7, 2015

thedrow mentioned this issue Sep 20, 2016

Download requirements in parallel #3981

Closed

4 tasks

thedrow mentioned this issue Aug 8, 2017

Make indentation work with multiple threads #4654

Closed

pradyunsg mentioned this issue Aug 9, 2017

Redesigning pip's output (for install, wheel and download commands) #4649

Open

pradyunsg added the resolution: deferred till PR Further discussion will happen when a PR is made label Aug 20, 2017

samuelcolvin mentioned this issue Jul 15, 2019

Cythonized wheels for Alpine linux? pydantic/pydantic#662

Closed

chrahunt mentioned this issue Nov 4, 2019

Speed up build environment creation (re-usable, faster) #7294

Open

McSinyx mentioned this issue Mar 28, 2020

Don't print full paths for cached wheels when using them #7815

Open

McSinyx mentioned this issue Apr 29, 2020

Multithreading and unsupported platforms #8169

Closed

pradyunsg mentioned this issue May 31, 2020

Parallelizing the install process + PoC! #8187

Open

McSinyx mentioned this issue Sep 8, 2020

[fast-deps] Parallelize wheel download #8771

Closed

uranusjr mentioned this issue Oct 1, 2020

Install packages asynchronously #8946

Closed

huonw mentioned this issue Jan 12, 2023

lock create with "large" set of dependencies spends 95+% of time in sequential pip download pex-tool/pex#2036

Closed

karlicoss mentioned this issue Nov 19, 2023

--parallel flag can be flaky at times karlicoss/HPI#306

Open

sovrasov mentioned this issue Jun 12, 2024

Revert changes required to pass CI after python version change openvinotoolkit/model_api#184

Merged

pelson mentioned this issue Aug 19, 2024

metadata resolve workstream #12921

Open

10 tasks

cosmicexplorer mentioned this issue Aug 19, 2024

execute batch downloads in parallel worker threads #12923

Draft

Parallel downloads #825

Parallel downloads #825

Comments

cool-RR commented Mar 4, 2013 • edited by pradyunsg Loading

This comment has been minimized.

dstufft commented Jan 20, 2014

cool-RR commented Jan 20, 2014

thedrow commented Jan 21, 2014

jquast commented Jun 23, 2014

sholsapp commented Nov 18, 2014

mattvonrocketstein commented Oct 1, 2015

jamesob commented Oct 20, 2015

mattvonrocketstein commented Oct 20, 2015

sholsapp commented Oct 20, 2015

fillest commented Apr 27, 2016

RonnyPfannschmidt commented Apr 27, 2016

mattvonrocketstein commented Apr 27, 2016

RonnyPfannschmidt commented Apr 28, 2016

thedrow commented Sep 20, 2016

pradyunsg commented May 16, 2017 • edited Loading

fruechel commented May 16, 2017

pradyunsg commented May 17, 2017 • edited Loading

pradyunsg commented May 17, 2017

fruechel commented May 17, 2017

mattvonrocketstein commented May 17, 2017

AraHaan commented May 17, 2017

ghost commented Jul 31, 2017

pradyunsg commented Aug 20, 2017 • edited Loading

pradyunsg commented Aug 20, 2017

pradyunsg commented Jun 5, 2019 • edited Loading

samuelcolvin commented Jun 5, 2019

KOLANICH commented Jun 5, 2019 • edited Loading

samuelcolvin commented Jun 5, 2019 • edited Loading

pfmoore commented Jun 5, 2019

samuelcolvin commented Jun 5, 2019 • edited Loading

samuelcolvin commented Jun 5, 2019

pfmoore commented Jun 5, 2019

samuelcolvin commented Jun 5, 2019

pradyunsg commented Jun 6, 2019

thedrow commented Jun 6, 2019

pradyunsg commented Jun 6, 2019

thedrow commented Jun 8, 2019

jsar3004 commented Jul 2, 2020

McSinyx commented Jul 2, 2020

KOLANICH commented Jul 2, 2020

pradyunsg commented Jul 2, 2020

cosmicexplorer commented Aug 19, 2024

cool-RR commented Mar 4, 2013 •

edited by pradyunsg

Loading

pradyunsg commented May 16, 2017 •

edited

Loading

pradyunsg commented May 17, 2017 •

edited

Loading

pradyunsg commented Aug 20, 2017 •

edited

Loading

pradyunsg commented Jun 5, 2019 •

edited

Loading

KOLANICH commented Jun 5, 2019 •

edited

Loading

samuelcolvin commented Jun 5, 2019 •

edited

Loading

samuelcolvin commented Jun 5, 2019 •

edited

Loading