ENH: Enable huge pages in all Linux builds #14216

pentschev · 2019-08-07T10:22:40Z

This PR modifies memory allocation in such a way that huge pages for large arrays is enabled in all Linux builds (even if MADV_HUGEPAGE is not defined). As per discussion in #14177, this isn't the case currently, which means huge pages may never be available depending on the NumPy build, even if the runtime system supports it. Both conda-forge and pip builds for 1.17.0 seem to not include support for huge pages.

The madvise() man-pages indicates that it will return EINVAL if the advice is invalid. At runtime, that means that it will be enabled if available, otherwise for the use case in NumPy it's enough to simply ignore the error to fallback to default behavior.

Fixes #14177 .

cc @hmaarrfk @jakirkham @charris @seberg @mrocklin

hmaarrfk · 2019-08-07T12:01:49Z

numpy/core/src/multiarray/alloc.c

+            // Use code 14 (MADV_HUGEPAGE) if it isn't defined. This gives
+            // a chance of enabling huge pages even if build with linux
+            // kernel < 2.6.38
+            madvise((void*)((npy_uintp)p + offset), length, 14);


Why not just define MADV_HUGEPAGES and let it go through a more similar code path

Yes, thanks for the good idea, let me fix that.

Done in the latest commits, let me know if that's ok with you.

ok with me means nothing, just a suggestion ;)

Well, if it's ok with you, it means that you probably don't disagree nor have an idea to make it even cleaner. :)

hmaarrfk · 2019-08-07T13:37:57Z

numpy/core/src/multiarray/alloc.c

@@ -76,14 +83,7 @@ _npy_alloc_cache(npy_uintp nelem, npy_uintp esz, npy_uint msz,
        if (NPY_UNLIKELY(nelem * esz >= ((1u<<22u)))) {
            npy_uintp offset = 4096u - (npy_uintp)p % (4096u);
            npy_uintp length = nelem * esz - offset;
-#ifdef MADV_HUGEPAGE
            madvise((void*)((npy_uintp)p + offset), length, MADV_HUGEPAGE);


I think you should add a comment here that we explicitely do not check the return code of madvise because we want to optimistically ask for HUGEPAGE even in the case that an old kernel is being used.

Agreed. I added that comment in the latest commit.

hmaarrfk · 2019-08-13T02:55:19Z

Is there a test we can run to see if this is actually working? I'm trying to build numpy on my machine and it is unclear if madvise is having an effect.

pentschev · 2019-08-13T10:25:48Z

There are only two ways I found I could test: adding a printf to the code, or comparing the time to the pip package. This is the reason why I didn't add a test in this PR, I don't really know how to ensure it gets called.

seberg · 2019-08-13T14:39:26Z

numpy/core/src/multiarray/alloc.c

@@ -25,10 +25,14 @@

 #include <assert.h>

-#ifdef HAVE_SYS_MMAN_H
+#ifdef NPY_OS_LINUX


Is NPY_OS_LINUX a true superset of HAVE_SYS_MMAN_H? Just curious.

<sys/mman.h> dates back to Unix and I've confirmed that it exists at least since kernel 2.6.0 (possibly before), so I think it's a safe bet.

Sorry, I mean that NPY_OS_SOLARIS or NPY_OS_DARWIN may currently succeed HAVE_SYS_MMAN_H but are not covered under NPY_OS_LINUX

That's right, but we're only including <sys/mman.h> for MADV_HUGEPAGE, and, AFAIK, that's a feature only available on Linux.

hmaarrfk · 2019-08-13T15:17:12Z

Is there something we can do to test with hugeadm ???

pentschev · 2019-08-13T15:55:14Z

I tried checking hugeadm --pool-list, but I didn't see anything changing before/after allocation. Other options don't seem to provide any information that would be useful to check that. What I observed is that changes to AnonHugePages from /proc/meminfo match NumPy allocations with the changes in this PR, and remains stale with NumPy 1.17.0 pip package.

hmaarrfk · 2019-08-13T16:21:19Z

The increased usage of AnonHugePages is what I understood will happen too. Apparently after much use, the address space can get fragmented, so there are warnings that one should use pools to better allocation. I guess I'll have to look into those later .

Thanks for working through this.

pentschev · 2019-08-13T16:27:19Z

I think fragmentation will happen anyway for sufficiently large memory objects. I'm not sure I follow why is that relevant for a test though. The only thing I was pointing was that an alternative (probably not a great one) was to check that the array gets actually flagged as hugepage, which we could do by reading /proc/meminfo.

charris · 2019-08-16T22:00:34Z

@pentschev Do you feel happy with this? I'm a bit wary about making the backport, but if this is unlikely to break anything -- and I'm not sure how we find out without a release -- I'll do so.

pentschev · 2019-08-16T22:06:03Z

It worked well in my tests @charris, but that's really one setup among many that could exist. I also think it's unlikely to break anything, in the event that it does, I think it will be catched very easily, since all memory allocation will then be broken.

If this could be in the next release, I would be very happy by that! :)

charris · 2019-08-17T01:12:19Z

I removed the backport label, as it makes me a bit more comfortable to have the rc testing period in 1.18. If no one complains, I'll put this in soon.

pentschev · 2019-08-17T08:20:16Z

I would certainly prefer to have this backported at least to 1.17. We have some use cases where this gives us a major performance boost that otherwise we won't be seeing for the next 5-6 months until 1.18 gets released.

charris · 2019-08-20T18:49:14Z

l'm always willing to gamble and relearn why it is a bad idea. But this looks pretty safe, so let's give it a try in 1.17.

charris · 2019-08-20T19:03:51Z

numpy/core/src/multiarray/alloc.c

+/*
+ * Use code 14 (MADV_HUGEPAGE) if it isn't defined. This gives a chance of
+ * enabling huge pages even if built with linux kernel < 2.6.38
+ */


Just to clarify, this means that even if compiled on a machine running an old kernel, it will enable huge pages when run on a newer kernel?

That's correct. An attempt to enable it will be made, and if it errors out, hugepages is simply ignored.

charris · 2019-08-20T19:05:18Z

@pentschev Have you tested this running on old kernels?

charris · 2019-08-20T19:36:19Z

Checking... CentOS 5.11, used for building wheels, had kernel-2.6.18, so I guess we will know if this works soon enough. @pentschev There should be a nightly wheel built sometime in the next 24 hours that has this in it, so we can test it. The wheels can be found at https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com/

pentschev · 2019-08-20T21:10:33Z

@pentschev Have you tested this running on old kernels?

I haven't tested that, but according to kernel documentation, simply ignoring the error should still give us the old behavior.

Checking... CentOS 5.11, used for building wheels, had kernel-2.6.18, so I guess we will know if this works soon enough.

That's great, please let me know how it goes. Should anything go wrong I can work on a fix for that.

There should be a nightly wheel built sometime in the next 24 hours that has this in it, so we can test it. The wheels can be found at https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com/

Just checked, still not available but will watch for that and try it out once it's up. Thanks for the link @charris!

charris · 2019-08-20T21:37:27Z

I went ahead and triggered a wheels build for master. How long it takes varies, you can check it at https://travis-ci.org/MacPython/numpy-wheels.

charris · 2019-08-21T17:37:05Z

@pentschev New wheels are up.

pentschev · 2019-08-21T21:09:21Z

@charris I see a NumPy 1.18.0.dev0 wheels, were we supposed to have a 1.17.1 as well?

Either way, it works well on newer kernels at least, I for one am using 4.15.0-58 (Ubuntu 16.04) for the results below:

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.17.0'

In [3]: %%time
   ...: a = np.ones(int(1024**3), dtype='u1')
   ...:
   ...:
CPU times: user 131 ms, sys: 384 ms, total: 514 ms
Wall time: 512 ms

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.18.0.dev0+950dd4e'

In [3]: %%time
   ...: a = np.ones(int(1024**3), dtype='u1')
   ...:
   ...:
CPU times: user 24 ms, sys: 201 ms, total: 225 ms
Wall time: 223 ms

The second one is much faster, as expected. :)

pentschev · 2019-08-21T21:10:20Z

I used the Python 3.7 package for the result on my previous comment.

charris · 2019-08-21T22:22:39Z

I only triggered the master build, so dev0 is correct. Looks like it is working as advertised, so I'll go ahead with the backport. Thanks for testing.

charris · 2019-08-21T22:35:22Z

I went ahead and backported to 1.16.x as well, can't hurt.

Plantain · 2020-10-23T06:48:52Z

The increased usage of AnonHugePages is what I understood will happen too. Apparently after much use, the address space can get fragmented, so there are warnings that one should use pools to better allocation. I guess I'll have to look into those later .

l'm always willing to gamble and relearn why it is a bad idea. But this looks pretty safe, so let's give it a try in 1.17.

Was there ever any followup on the fragmentation? Because after a few days digging, this patch absolutely tanks performance on our workloads.
We do quite heavy computation with Numpy (manipulating ~1GB arrays), when we're running one process this patchset makes no difference - when I run 30 in parallel (on a machine with 128 cores, 400GB RAM), the address space starts to get fragmented faster than the kernel can defragment it, and performance drops to ~5% of single threaded performance, barely faster than processing serially.

Toggling on and off hugepages clears the issue entirely and they run in parallel with almost linear scaling.
Similarly setting /sys/kernel/mm/transparent_hugepage/defrag to "defer" restores the linear scaling.

charris · 2020-10-23T11:48:53Z

@Plantain My understanding was that there were early problems with fragmentation that were fixed in later linux kernels. However, it is certainly possible that your setup exposes new problems. What kernel version do you have?

Plantain · 2020-10-23T11:51:07Z

5.3.0.

…

On Fri, 23 Oct 2020, 9:49 pm Charles Harris, ***@***.***> wrote: @Plantain <https://github.com/Plantain> My understanding was that there were early problems with fragmentation that were fixed in later linux kernels. However, it is certainly possible that your setup exposes new problems. What kernel version do you have? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14216 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACODF3DTZTA6BCA24UFR5LSMFUTJANCNFSM4IJ6ZTEQ> .

charris · 2020-10-23T12:18:42Z

That is pretty recent... googling around, there seems to be a fair amount of ongoing work on the fragmentation problem as well as various ways to disable hugepages that may be distro dependent.

seberg · 2020-10-23T14:18:34Z

@Plantain it seemed mainly an issue on old kernels, so we only rolled it back on those (but only on 1.19, IIRC). We also have have an environment variable to disable madvise hugepages: https://numpy.org/doc/1.19/reference/global_state.html?highlight=global

That said, I was personally never quite sure what the tradeoffs are here.

pentschev · 2020-10-23T14:37:58Z

I think this will depend a lot on OS configuration, and we'll probably not find a one-size-fits-all, unfortunately. Having variables such as NUMPY_MADVISE_HUGEPAGE still seem to me like the right approach from the NumPy side, so if fragmentation can be a problem, maybe it would be good to document that.

…r spilling (#12399) A `bytearray` is zero-initialized using `calloc`, which we don't need. Additionally, `numpy.empty` both skips the zero-initialization and uses hugepages when available <numpy/numpy#14216>. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) - Vyas Ramasubramani (https://github.com/vyasr) - https://github.com/jakirkham Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - https://github.com/jakirkham URL: #12399

ENH, REL: Enable huge pages in all Linux builds

e5fed5e

hmaarrfk reviewed Aug 7, 2019

View reviewed changes

MAINT: define MADV_HUGEPAGE if not defined

0617d5c

hmaarrfk reviewed Aug 7, 2019

View reviewed changes

MAINT: comment on lack of error checks for madvise()

92fa985

charris changed the title ~~ENH, REL: Enable huge pages in all Linux builds~~ ENH: Enable huge pages in all Linux builds Aug 7, 2019

charris added 01 - Enhancement 09 - Backport-Candidate PRs tagged should be backported component: numpy._core labels Aug 7, 2019

seberg reviewed Aug 13, 2019

View reviewed changes

charris removed the 09 - Backport-Candidate PRs tagged should be backported label Aug 17, 2019

charris added the 09 - Backport-Candidate PRs tagged should be backported label Aug 20, 2019

charris added this to the 1.17.1 release milestone Aug 20, 2019

charris merged commit 950dd4e into numpy:master Aug 20, 2019

charris reviewed Aug 20, 2019

View reviewed changes

charris mentioned this pull request Aug 20, 2019

ENH: Enable huge pages in all Linux builds #14309

Merged

charris removed the 09 - Backport-Candidate PRs tagged should be backported label Aug 20, 2019

charris removed this from the 1.17.1 release milestone Aug 20, 2019

charris mentioned this pull request Aug 21, 2019

ENH: Enable huge pages in all Linux builds #14322

Merged

pentschev mentioned this pull request Sep 12, 2019

Out of Memory Sort Fails even with Spill over rapidsai/dask-cuda#57

Closed

jakirkham mentioned this pull request Feb 7, 2020

WIP: Aggregate received data into a large allocation dask/distributed#3453

Closed

jakirkham mentioned this pull request Jul 1, 2020

Evaluate further serialization performance improvements rapidsai/dask-cuda#106

Closed

jakirkham mentioned this pull request Aug 25, 2021

Testing network performance dask/distributed#5258

Open

madsbk mentioned this pull request Dec 16, 2022

Use numpy.empty() instead of bytearray to allocate host memory for spilling rapidsai/cudf#12399

Merged

ENH: Enable huge pages in all Linux builds #14216

ENH: Enable huge pages in all Linux builds #14216

Conversation

pentschev commented Aug 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmaarrfk commented Aug 13, 2019

pentschev commented Aug 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmaarrfk commented Aug 13, 2019

pentschev commented Aug 13, 2019

hmaarrfk commented Aug 13, 2019

pentschev commented Aug 13, 2019

charris commented Aug 16, 2019

pentschev commented Aug 16, 2019

charris commented Aug 17, 2019

pentschev commented Aug 17, 2019

charris commented Aug 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charris commented Aug 20, 2019 • edited Loading

charris commented Aug 20, 2019

pentschev commented Aug 20, 2019

charris commented Aug 20, 2019

charris commented Aug 21, 2019

pentschev commented Aug 21, 2019

pentschev commented Aug 21, 2019

charris commented Aug 21, 2019

charris commented Aug 21, 2019

Plantain commented Oct 23, 2020

charris commented Oct 23, 2020

Plantain commented Oct 23, 2020 via email

charris commented Oct 23, 2020

seberg commented Oct 23, 2020

pentschev commented Oct 23, 2020

charris commented Aug 20, 2019 •

edited

Loading