Using Large Pages (2M) in Node for Performance #16198

suresh-srinivas · 2017-10-14T04:17:04Z

Across a couple of workloads ( Node-DC-EIS and Ghost) I noticed that practically all the page walks are for 4K pages

Here is a specific example from Node-DC-EIS (normalized per transaction) on a Xeon Platinum 8180 server.

ITLB_MISSES.WALK_COMPLETED	6,872.3739
ITLB_MISSES.WALK_COMPLETED_2M_4M	2.3691
ITLB_MISSES.WALK_COMPLETED_4K	6,869.9723

This results in about 16% of the cycles stalled in the CPU Front End performing page walks using the TMAM Methodology

Several (Java JVM, PHP, HHVM) runtimes have support for Large Pages. They allocate either the hot static code segments and/or dynamic JIT code segments in Large 2M pages. There is typically several percentage performance improvement depending on how much the stall cycles are for page walks.

I wanted to have a discussion of what the community thinks of this I would also be interested in seeing some more data from other workloads. The following perf command is an easy way to get this data for your workload.

perf stat -e cpu/event=0x85,umask=0xe,name=itlb_misses_walk_completed/ -- sleep 30
perf stat -e cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed_4k/ -- sleep 30
perf stat -e cpu/event=0x85,umask=0x4,name=itlb_misses_walk_completed_2m_4m/ --sleep 30

A simple implementation would start with mapping all the .text segment code into large pages (this would be about 20 lines of code on Linux) and it would work reasonably well on modern CPU's. On older CPU's (such as SandyBridge) which have only a 1 level 2M TLB this is not efficient, and a more efficient implementation would only map the hot .text segment to large pages.

The text was updated successfully, but these errors were encountered:

joyeecheung · 2017-10-14T04:50:10Z

Possibly related: #11077 (fragmentation in large pages)

suresh-srinivas · 2017-10-14T05:08:43Z

@joyeecheung this would not be transparent huge pages but explicity mapping using mmap with MAP_HUGETLB

joyeecheung · 2017-10-14T05:25:23Z

@suresh-srinivas Yes, but if V8 integrates better with large pages (using mmap with MAP_HUGETLB , manage larger pages) then that issue could potentially be alleviated? (See #11077 (comment))

bnoordhuis · 2017-10-14T10:58:14Z

I don't think there is anything actionable right now. Node.js doesn't mmap memory itself, that's done by V8 and glibc on behalf of node.js.

Node.js could mmap some memory directly. For allocations that are released again on the same thread that would be a win, our ArrayBufferAllocator in particular, as it would avoid the overhead of regular malloc/free.

(Having said that: I experimented with that approach a few years ago and the results were inconclusive. YMMV, benchmark carefully.)

With regard to V8, it is currently hard-coded to allocate in multiples of the page size up to 512 kB (that limit applies to executable memory in particular.) Quite a bit of work would have to be done to remove the limit and I'm not sure if it would be well-received because of the security implications that were mentioned in #11077.

suresh-srinivas · 2017-10-17T15:11:16Z

@joyeecheung yes v8 could use large pages for it's JITted code pages. What I was thinking about was for the node binary and all the dso's it links against.

suresh-srinivas · 2017-10-17T15:16:18Z

@bnoordhuis thanks. This issue is about remapping the .text pages to use 2M pages. Looks like you are talking about data (malloced) pages.
Yes it is actionable by
a) Having some code after node startup to remap the pages in /proc/node-pid/maps into 2M pages.
b) Alternatively if we allow dependency on another library we could use the libhugetlbfs and relink the node binary and that library will take care of mapping the .text and .bss segment into 2M pages

gireeshpunathil · 2017-10-17T16:27:16Z

@suresh-srinivas - that looks interesrting to me, and does not harm in experimenting to see how it goes. These are r**p sections rarely meant to be de-allocated or unmapped from the process, so their presence in large pages can reduce page misses by large?

So here is the map looks like on a booted node, as you can see node itself is the predominant code, followed by libstd. Do you propose to remap node's own pages alone, or everything?

00400000-020ee000 r-xp 00000000 00:13 63963947                           node
34d0000000-34d0020000 r-xp 00000000 08:06 125                            /lib64/ld-2.12.so
34d021f000-34d0220000 r--p 0001f000 08:06 125                            /lib64/ld-2.12.so
34d0220000-34d0221000 rw-p 00020000 08:06 125                            /lib64/ld-2.12.so
34d0400000-34d058a000 r-xp 00000000 08:06 196                            /lib64/libc-2.12.so
34d058a000-34d078a000 ---p 0018a000 08:06 196                            /lib64/libc-2.12.so
34d078a000-34d078e000 r--p 0018a000 08:06 196                            /lib64/libc-2.12.so
34d078e000-34d078f000 rw-p 0018e000 08:06 196                            /lib64/libc-2.12.so
34d0800000-34d0817000 r-xp 00000000 08:06 304                            /lib64/libpthread-2.12.so
34d0817000-34d0a17000 ---p 00017000 08:06 304                            /lib64/libpthread-2.12.so
34d0a17000-34d0a18000 r--p 00017000 08:06 304                            /lib64/libpthread-2.12.so
34d0a18000-34d0a19000 rw-p 00018000 08:06 304                            /lib64/libpthread-2.12.so
34d0c00000-34d0c02000 r-xp 00000000 08:06 387                            /lib64/libdl-2.12.so
34d0c02000-34d0e02000 ---p 00002000 08:06 387                            /lib64/libdl-2.12.so
34d0e02000-34d0e03000 r--p 00002000 08:06 387                            /lib64/libdl-2.12.so
34d0e03000-34d0e04000 rw-p 00003000 08:06 387                            /lib64/libdl-2.12.so
34d1000000-34d1007000 r-xp 00000000 08:06 341                            /lib64/librt-2.12.so
34d1007000-34d1206000 ---p 00007000 08:06 341                            /lib64/librt-2.12.so
34d1206000-34d1207000 r--p 00006000 08:06 341                            /lib64/librt-2.12.so
34d1207000-34d1208000 rw-p 00007000 08:06 341                            /lib64/librt-2.12.so
34d1800000-34d1883000 r-xp 00000000 08:06 368                            /lib64/libm-2.12.so
34d1883000-34d1a82000 ---p 00083000 08:06 368                            /lib64/libm-2.12.so
34d1a82000-34d1a83000 r--p 00082000 08:06 368                            /lib64/libm-2.12.so
34d1a83000-34d1a84000 rw-p 00083000 08:06 368                            /lib64/libm-2.12.so
3613e00000-3613e16000 r-xp 00000000 08:06 538                            /lib64/libgcc_s-4.4.7-20120601.so.1
3613e16000-3614015000 ---p 00016000 08:06 538                            /lib64/libgcc_s-4.4.7-20120601.so.1
3614015000-3614016000 rw-p 00015000 08:06 538                            /lib64/libgcc_s-4.4.7-20120601.so.1
3614200000-36142e8000 r-xp 00000000 08:02 262942                         /usr/lib64/libstdc++.so.6.0.13
36142e8000-36144e8000 ---p 000e8000 08:02 262942                         /usr/lib64/libstdc++.so.6.0.13
36144e8000-36144ef000 r--p 000e8000 08:02 262942                         /usr/lib64/libstdc++.so.6.0.13
36144ef000-36144f1000 rw-p 000ef000 08:02 262942                         /usr/lib64/libstdc++.so.6.0.13

Do you have a PoC code that we can try integrating and running against some benchmarks?

suresh-srinivas · 2017-10-18T00:10:38Z

@gireeshpunathil thanks. I was initially planning to remap node's own .text segment. Yes will send a PR out when I have it completed.

jasnell · 2017-10-27T18:08:29Z

I'm +1 on at least having a PR we can use to experiment with the impact of this. It certainly does make sense, we would just need some solid benchmark results to show it is worth the effort.

suresh-srinivas · 2017-10-31T17:45:47Z

I have an initial prototype working (with libhugetlbfs). I am seeing good reduction in ITLB misses and performance improvement (6%) for one workload I have tested react-server-side-rendering . I will work with @uttampawar and @mhdawson to run a few more benchmarks from the Node-Benchmarking WG. I will need help from someone in the build team to integrate with libhugetlbfs, I am building and installing libhugetlbfs and modifying the final g++ link line in node. If this integration is not desired, I will have to manually write code to do the mmap

bnoordhuis · 2017-10-31T17:57:29Z

Not desired, it's LGPL (and it's not something we'd want as an external dependency.)

suresh-srinivas · 2018-03-16T03:35:05Z

Sorry this took so long. I got a chunk of time when it was snowing here in Portland and I have completed an implementation under Linux of programmatically mapping a subset of the Node.js text segment to 2M pages and it is demonstrating 4-5% performance on React-SSR workload and reduction in ITLB misses and Front End bottleneck. More work needed to programmatically choose to use either Explicit Huge Pages or Anonymous Huge Pages and if sufficient number are present.

@Trott @addaleax I just read your medium article, thanks for all the work you do to help contibutors to the node project. I could use a mentor or two to help get this in. @uttampawar has kindly code reviewed it and I incorporated most of his suggestions.

@gireeshpunathil I now have a PoC, so if you want to try it out, let me know. @joyeecheung let me know if you want to try it as well.

The algorithm is quite simple but the implementation was a bit tricky!

Find the text region of node in memory
We read the maps file and find the start and end addresss of the loaded node process
Within that start and end address is the .text region is what we are interested in.
We modify the linker script to PROVIDE(__nodetext) which points to this region. The linker already provide __etext which is the end of the text segment
Move the text region to large pages
We need to be very careful. This function that does the move should not be moved!
We use a gcc option to put it outside the .text area
This function should not call any functions that might be moved. (particularly through the PLT which is also in the .text segment but is before the __nodetext.
1. We map a new area and copy the original code there
2. We mmap using HUGE_TLB
3. If we are successful we copy the code there and unmap the original region.

gireeshpunathil · 2018-03-16T04:07:46Z

@suresh-srinivas - thanks, please raise a PR if it is finalized and ready for review, or PR with [WIP] tag if in half-baked state (for which design level changes are anticipated).

gireeshpunathil · 2018-05-17T06:46:57Z

ping @suresh-srinivas

suresh-srinivas · 2018-05-17T16:00:34Z

@gireeshpunathil thanks for checking in. The development is complete and code is ready to send as PR. @uttampawar and I are measuring performance should be done by the weekend.

suresh-srinivas · 2018-05-31T22:55:27Z

@joyeecheung @jasnell @gireeshpunathil @Trott @addaleax
We sent in the pull request today. Could you take a look?

mmarchini · 2018-10-29T17:07:07Z

@suresh-srinivas should this remain open or was it addressed by #22079?

suresh-srinivas · 2018-10-29T17:15:22Z

@suresh-srinivas should this remain open or was it addressed by #22079?

Yes this is addressed by PR #22079?

joyeecheung added the performance Issues and PRs related to the performance of Node.js. label Oct 14, 2017

suresh-srinivas mentioned this issue May 31, 2018

Large Page Support for Code Issue: 16198 #21064

Closed

4 tasks

suresh-srinivas mentioned this issue Oct 15, 2018

Linux large pages - after rebase #22079

Merged

2 tasks

suresh-srinivas closed this as completed Oct 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Large Pages (2M) in Node for Performance #16198

Using Large Pages (2M) in Node for Performance #16198

suresh-srinivas commented Oct 14, 2017 •

edited

Loading

joyeecheung commented Oct 14, 2017

suresh-srinivas commented Oct 14, 2017

joyeecheung commented Oct 14, 2017 •

edited

Loading

bnoordhuis commented Oct 14, 2017

suresh-srinivas commented Oct 17, 2017

suresh-srinivas commented Oct 17, 2017 •

edited

Loading

gireeshpunathil commented Oct 17, 2017

suresh-srinivas commented Oct 18, 2017

jasnell commented Oct 27, 2017

suresh-srinivas commented Oct 31, 2017

bnoordhuis commented Oct 31, 2017

suresh-srinivas commented Mar 16, 2018 •

edited

Loading

gireeshpunathil commented Mar 16, 2018

gireeshpunathil commented May 17, 2018

suresh-srinivas commented May 17, 2018

suresh-srinivas commented May 31, 2018

mmarchini commented Oct 29, 2018

suresh-srinivas commented Oct 29, 2018

Using Large Pages (2M) in Node for Performance #16198

Using Large Pages (2M) in Node for Performance #16198

Comments

suresh-srinivas commented Oct 14, 2017 • edited Loading

joyeecheung commented Oct 14, 2017

suresh-srinivas commented Oct 14, 2017

joyeecheung commented Oct 14, 2017 • edited Loading

bnoordhuis commented Oct 14, 2017

suresh-srinivas commented Oct 17, 2017

suresh-srinivas commented Oct 17, 2017 • edited Loading

gireeshpunathil commented Oct 17, 2017

suresh-srinivas commented Oct 18, 2017

jasnell commented Oct 27, 2017

suresh-srinivas commented Oct 31, 2017

bnoordhuis commented Oct 31, 2017

suresh-srinivas commented Mar 16, 2018 • edited Loading

gireeshpunathil commented Mar 16, 2018

gireeshpunathil commented May 17, 2018

suresh-srinivas commented May 17, 2018

suresh-srinivas commented May 31, 2018

mmarchini commented Oct 29, 2018

suresh-srinivas commented Oct 29, 2018

suresh-srinivas commented Oct 14, 2017 •

edited

Loading

joyeecheung commented Oct 14, 2017 •

edited

Loading

suresh-srinivas commented Oct 17, 2017 •

edited

Loading

suresh-srinivas commented Mar 16, 2018 •

edited

Loading