Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Large Pages (2M) in Node for Performance #16198

Closed
suresh-srinivas opened this issue Oct 14, 2017 · 18 comments
Closed

Using Large Pages (2M) in Node for Performance #16198

suresh-srinivas opened this issue Oct 14, 2017 · 18 comments
Labels
performance Issues and PRs related to the performance of Node.js.

Comments

@suresh-srinivas
Copy link
Contributor

suresh-srinivas commented Oct 14, 2017

Across a couple of workloads ( Node-DC-EIS and Ghost) I noticed that practically all the page walks are for 4K pages

Here is a specific example from Node-DC-EIS (normalized per transaction) on a Xeon Platinum 8180 server.

ITLB_MISSES.WALK_COMPLETED 6,872.3739
ITLB_MISSES.WALK_COMPLETED_2M_4M 2.3691
ITLB_MISSES.WALK_COMPLETED_4K 6,869.9723

This results in about 16% of the cycles stalled in the CPU Front End performing page walks using the TMAM Methodology

Several (Java JVM, PHP, HHVM) runtimes have support for Large Pages. They allocate either the hot static code segments and/or dynamic JIT code segments in Large 2M pages. There is typically several percentage performance improvement depending on how much the stall cycles are for page walks.

I wanted to have a discussion of what the community thinks of this I would also be interested in seeing some more data from other workloads. The following perf command is an easy way to get this data for your workload.

perf stat -e cpu/event=0x85,umask=0xe,name=itlb_misses_walk_completed/ -- sleep 30
perf stat -e cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed_4k/ -- sleep 30
perf stat -e cpu/event=0x85,umask=0x4,name=itlb_misses_walk_completed_2m_4m/ --sleep 30

A simple implementation would start with mapping all the .text segment code into large pages (this would be about 20 lines of code on Linux) and it would work reasonably well on modern CPU's. On older CPU's (such as SandyBridge) which have only a 1 level 2M TLB this is not efficient, and a more efficient implementation would only map the hot .text segment to large pages.

@joyeecheung joyeecheung added the performance Issues and PRs related to the performance of Node.js. label Oct 14, 2017
@joyeecheung
Copy link
Member

Possibly related: #11077 (fragmentation in large pages)

@suresh-srinivas
Copy link
Contributor Author

@joyeecheung this would not be transparent huge pages but explicity mapping using mmap with MAP_HUGETLB

@joyeecheung
Copy link
Member

joyeecheung commented Oct 14, 2017

@suresh-srinivas Yes, but if V8 integrates better with large pages (using mmap with MAP_HUGETLB , manage larger pages) then that issue could potentially be alleviated? (See #11077 (comment))

@bnoordhuis
Copy link
Member

I don't think there is anything actionable right now. Node.js doesn't mmap memory itself, that's done by V8 and glibc on behalf of node.js.

Node.js could mmap some memory directly. For allocations that are released again on the same thread that would be a win, our ArrayBufferAllocator in particular, as it would avoid the overhead of regular malloc/free.

(Having said that: I experimented with that approach a few years ago and the results were inconclusive. YMMV, benchmark carefully.)

With regard to V8, it is currently hard-coded to allocate in multiples of the page size up to 512 kB (that limit applies to executable memory in particular.) Quite a bit of work would have to be done to remove the limit and I'm not sure if it would be well-received because of the security implications that were mentioned in #11077.

@suresh-srinivas
Copy link
Contributor Author

@joyeecheung yes v8 could use large pages for it's JITted code pages. What I was thinking about was for the node binary and all the dso's it links against.

@suresh-srinivas
Copy link
Contributor Author

suresh-srinivas commented Oct 17, 2017

@bnoordhuis thanks. This issue is about remapping the .text pages to use 2M pages. Looks like you are talking about data (malloced) pages.
Yes it is actionable by
a) Having some code after node startup to remap the pages in /proc/node-pid/maps into 2M pages.
b) Alternatively if we allow dependency on another library we could use the libhugetlbfs and relink the node binary and that library will take care of mapping the .text and .bss segment into 2M pages

@gireeshpunathil
Copy link
Member

@suresh-srinivas - that looks interesrting to me, and does not harm in experimenting to see how it goes. These are r**p sections rarely meant to be de-allocated or unmapped from the process, so their presence in large pages can reduce page misses by large?

So here is the map looks like on a booted node, as you can see node itself is the predominant code, followed by libstd. Do you propose to remap node's own pages alone, or everything?

00400000-020ee000 r-xp 00000000 00:13 63963947                           node
34d0000000-34d0020000 r-xp 00000000 08:06 125                            /lib64/ld-2.12.so
34d021f000-34d0220000 r--p 0001f000 08:06 125                            /lib64/ld-2.12.so
34d0220000-34d0221000 rw-p 00020000 08:06 125                            /lib64/ld-2.12.so
34d0400000-34d058a000 r-xp 00000000 08:06 196                            /lib64/libc-2.12.so
34d058a000-34d078a000 ---p 0018a000 08:06 196                            /lib64/libc-2.12.so
34d078a000-34d078e000 r--p 0018a000 08:06 196                            /lib64/libc-2.12.so
34d078e000-34d078f000 rw-p 0018e000 08:06 196                            /lib64/libc-2.12.so
34d0800000-34d0817000 r-xp 00000000 08:06 304                            /lib64/libpthread-2.12.so
34d0817000-34d0a17000 ---p 00017000 08:06 304                            /lib64/libpthread-2.12.so
34d0a17000-34d0a18000 r--p 00017000 08:06 304                            /lib64/libpthread-2.12.so
34d0a18000-34d0a19000 rw-p 00018000 08:06 304                            /lib64/libpthread-2.12.so
34d0c00000-34d0c02000 r-xp 00000000 08:06 387                            /lib64/libdl-2.12.so
34d0c02000-34d0e02000 ---p 00002000 08:06 387                            /lib64/libdl-2.12.so
34d0e02000-34d0e03000 r--p 00002000 08:06 387                            /lib64/libdl-2.12.so
34d0e03000-34d0e04000 rw-p 00003000 08:06 387                            /lib64/libdl-2.12.so
34d1000000-34d1007000 r-xp 00000000 08:06 341                            /lib64/librt-2.12.so
34d1007000-34d1206000 ---p 00007000 08:06 341                            /lib64/librt-2.12.so
34d1206000-34d1207000 r--p 00006000 08:06 341                            /lib64/librt-2.12.so
34d1207000-34d1208000 rw-p 00007000 08:06 341                            /lib64/librt-2.12.so
34d1800000-34d1883000 r-xp 00000000 08:06 368                            /lib64/libm-2.12.so
34d1883000-34d1a82000 ---p 00083000 08:06 368                            /lib64/libm-2.12.so
34d1a82000-34d1a83000 r--p 00082000 08:06 368                            /lib64/libm-2.12.so
34d1a83000-34d1a84000 rw-p 00083000 08:06 368                            /lib64/libm-2.12.so
3613e00000-3613e16000 r-xp 00000000 08:06 538                            /lib64/libgcc_s-4.4.7-20120601.so.1
3613e16000-3614015000 ---p 00016000 08:06 538                            /lib64/libgcc_s-4.4.7-20120601.so.1
3614015000-3614016000 rw-p 00015000 08:06 538                            /lib64/libgcc_s-4.4.7-20120601.so.1
3614200000-36142e8000 r-xp 00000000 08:02 262942                         /usr/lib64/libstdc++.so.6.0.13
36142e8000-36144e8000 ---p 000e8000 08:02 262942                         /usr/lib64/libstdc++.so.6.0.13
36144e8000-36144ef000 r--p 000e8000 08:02 262942                         /usr/lib64/libstdc++.so.6.0.13
36144ef000-36144f1000 rw-p 000ef000 08:02 262942                         /usr/lib64/libstdc++.so.6.0.13

Do you have a PoC code that we can try integrating and running against some benchmarks?

@suresh-srinivas
Copy link
Contributor Author

@gireeshpunathil thanks. I was initially planning to remap node's own .text segment. Yes will send a PR out when I have it completed.

@jasnell
Copy link
Member

jasnell commented Oct 27, 2017

I'm +1 on at least having a PR we can use to experiment with the impact of this. It certainly does make sense, we would just need some solid benchmark results to show it is worth the effort.

@suresh-srinivas
Copy link
Contributor Author

I have an initial prototype working (with libhugetlbfs). I am seeing good reduction in ITLB misses and performance improvement (6%) for one workload I have tested react-server-side-rendering . I will work with @uttampawar and @mhdawson to run a few more benchmarks from the Node-Benchmarking WG. I will need help from someone in the build team to integrate with libhugetlbfs, I am building and installing libhugetlbfs and modifying the final g++ link line in node. If this integration is not desired, I will have to manually write code to do the mmap

@bnoordhuis
Copy link
Member

Not desired, it's LGPL (and it's not something we'd want as an external dependency.)

@suresh-srinivas
Copy link
Contributor Author

suresh-srinivas commented Mar 16, 2018

Sorry this took so long. I got a chunk of time when it was snowing here in Portland and I have completed an implementation under Linux of programmatically mapping a subset of the Node.js text segment to 2M pages and it is demonstrating 4-5% performance on React-SSR workload and reduction in ITLB misses and Front End bottleneck. More work needed to programmatically choose to use either Explicit Huge Pages or Anonymous Huge Pages and if sufficient number are present.

@Trott @addaleax I just read your medium article, thanks for all the work you do to help contibutors to the node project. I could use a mentor or two to help get this in. @uttampawar has kindly code reviewed it and I incorporated most of his suggestions.

@gireeshpunathil I now have a PoC, so if you want to try it out, let me know. @joyeecheung let me know if you want to try it as well.

The algorithm is quite simple but the implementation was a bit tricky!

  1. Find the text region of node in memory
    We read the maps file and find the start and end addresss of the loaded node process
    Within that start and end address is the .text region is what we are interested in.
    We modify the linker script to PROVIDE(__nodetext) which points to this region. The linker already provide __etext which is the end of the text segment
  2. Move the text region to large pages
    We need to be very careful. This function that does the move should not be moved!
    We use a gcc option to put it outside the .text area
    This function should not call any functions that might be moved. (particularly through the PLT which is also in the .text segment but is before the __nodetext.
    1. We map a new area and copy the original code there
    2. We mmap using HUGE_TLB
    3. If we are successful we copy the code there and unmap the original region.

@gireeshpunathil
Copy link
Member

@suresh-srinivas - thanks, please raise a PR if it is finalized and ready for review, or PR with [WIP] tag if in half-baked state (for which design level changes are anticipated).

@gireeshpunathil
Copy link
Member

ping @suresh-srinivas

@suresh-srinivas
Copy link
Contributor Author

@gireeshpunathil thanks for checking in. The development is complete and code is ready to send as PR. @uttampawar and I are measuring performance should be done by the weekend.

@suresh-srinivas
Copy link
Contributor Author

@joyeecheung @jasnell @gireeshpunathil @Trott @addaleax
We sent in the pull request today. Could you take a look?

@mmarchini
Copy link
Contributor

@suresh-srinivas should this remain open or was it addressed by #22079?

@suresh-srinivas
Copy link
Contributor Author

@suresh-srinivas should this remain open or was it addressed by #22079?

Yes this is addressed by PR #22079?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues and PRs related to the performance of Node.js.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants