Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch native GC policy to adaptive #28267

Closed
galderz opened this issue Sep 28, 2022 · 2 comments · Fixed by #28295
Closed

Switch native GC policy to adaptive #28267

galderz opened this issue Sep 28, 2022 · 2 comments · Fixed by #28295
Labels
Milestone

Comments

@galderz
Copy link
Member

galderz commented Sep 28, 2022

Summary

I would like to propose that Quarkus Native switches to using the adaptive GC policy by default instead of space/time. Users can still fallback on the space/time GC policy by passing in (note the need to escape $ sign if passing in via command line):

-Dquarkus.native.additional-build-args=-H:InitialCollectionPolicy=com.oracle.svm.core.genscavenge.CollectionPolicy\$BySpaceAndTime

Adaptive is GraalVM CE’s default GC policy for serial GC, so to achieve this all we have to do is remove the Quarkus code that sets space/time as the GC policy.

Motivation

When no -Xmx is configured and systems have more than ~3GB physical memory, space/time can demonstrate odd looking memory consumption when comparing same workloads with native executables running adaptive GC policy and equivalent memory configuration. This discrepancy was recently observed in testing carried out by a third party, and hence this is the primary reason for the investigation that led to the creation of this issue.

Aside from epsilon GC, GraalVM CE contains a single usable GC algorithm which is serial GC (Same as the OpenJDK version, this is a single-threaded, non-concurrent GC but it’s a different implementation and has different characteristics). The GC collection policy within it is configurable. This policy dictates when young generation and old generation collections happen.

The default policy is called “adaptive” and it is based on HotSpot's ParallelGC adaptive size policy. The main difference with HotSpot is its focus on memory footprint. This means that GraalVM’s adaptive GC policy tries to aggressively trigger GCs in order to keep memory consumption down. Adaptive’s most important setting is GC_TIME_RATIO, though this, and other constants in it, cannot be configured via the command line right now:

    /**
     * Ratio of mutator wall-clock time to GC wall-clock time. HotSpot's default is 99, i.e.
     * spending 1% of time in GC. We set it to 19, i.e. 5%, to prefer a small footprint.
     */
    private static final int GC_TIME_RATIO = 19;

Back in 2018 Quarkus switched from using the default GC policy to use one called “space/time”. This is a collection policy that tries to delay full GCs until the heap has at least reached minimum heap size, and then tries to balance time between young and full GCs.

One reason the switch to “space/time” happened is because the default GC policy (“adaptive”) was deemed to generate too many full GCs. One of the key differences between OpenJDK’s serial GC and GraalVM’s serial GC is precisely on how these full GCs are executed. In OpenJDK the algorithm used is mark-sweep-compact whereas in GraalVM it is mark-copy. Both need to traverse all live objects, but in mark-copy this traversal is also used to copy live objects to a secondary space or semi-space. As objects are copied from one semi-space to another they’re also compacted. In mark-sweep-compact, the compacting requires a second pass on the live objects. This makes full GCs in mark-copy more time efficient (in terms of time spent in each GC cycle) than mark-sweep-compact. The tradeoff mark-copy makes in order to make individual full GC cycles shorter is space. The use of semi-spaces means that for an application to maintain the same GC performance that mark-sweep achieves (in terms of allocated MB per second), it requires double the amount of memory.

Focusing now on the GC policies, the runtime memory consumption of the space/time and the adaptive GC policies doesn’t differ much, except for the case when no maximum heap size (-Xmx) is passed to the native executable at runtime. In this situation and running in systems that have as little as ~3GB physical memory, space/time chooses a minimum heap size of 512m, which is derived from 2 * young space size (by default 256m). Systems with smaller physical memory will choose smaller minimum heap sizes, but anything above ~3GB physical memory will be fixed to 512m minimum heap size, if no -Xmx is passed in.

So, a simple Quarkus plaintext load test of the server with no -Xmx will demonstrate increasing memory consumption until it reaches that 512m mark and then a full GC clearing most of the memory. By contrast, the adaptive policy will try to keep footprint low and will consume less than 1/2 memory that space/time does. This discrepancy was recently observed in testing carried out by a third party, and hence this is the primary reason for the investigation that led to the creation of this issue.

Once -Xmx is configured, memory consumption evens out. As an example, if we set -Xmx512m, space/time takes a 10% of that as young space and then multiply that by 2*, giving you roughly ~100m minimum heap size. With this kind of -Xmx values, both adaptive and space/time end up consuming roughly the same amount of memory.

To understand the performance of adaptive versus space/time, I’ve carried out some tests on my local server with a laptop as a client (with Quarkus commit 89fc74fdf0 from 13th September), direct 1 Gbp switch connection and Quarkus pinned to a 4-CPU NUMA node.

In my experiments with mid-range load workloads, adaptive seems to perform better:

Benchmark details: Reactive database, fixed phase throughput workload, with Quarkus assigned NUMA CPUs at ~50% usage. (Ideally I would run this workload at 80% CPU usage but I need to do further tuning on the db side to achieve this)

Compare space/time vs adaptive with -Xmx512m
============================================
Comparing runs 000B and 000C
PHASE                      METRIC        REQUESTS       MEAN                 p50                 p90                  p99                 p99.9               p99.99
rampup/get_zipcode         fetchIndex     +631(+0.53%)  +433.09 μs(+17.40%)   -53.25 μs(-5.88%)  -614.40 μs(-31.12%)  -262.14 μs(-2.86%)  +71.30 ms(+25.00%)  +52.43 ms(+15.06%)
rampup/post_zipcode        fetchIndex     +741(+0.62%)  +433.96 μs(+16.87%)   -53.25 μs(-5.14%)  -638.98 μs(-30.23%)   -65.54 μs(-0.71%)  +75.50 ms(+26.28%)  +60.82 ms(+17.16%)
main/get_zipcode           fetchIndex     +296(+0.05%)    -70.01 μs(-4.48%)   -57.34 μs(-5.15%)   -245.76 μs(-8.43%)   +1.28 ms(+17.03%)   -2.23 ms(-17.17%)   -5.96 ms(-36.84%)
main/get_zipcode_by_city   fetchDetails   -949(-0.16%)    -69.22 μs(-1.05%)   -65.54 μs(-1.07%)   -262.14 μs(-3.28%)  +524.29 μs(+4.04%)   -2.62 ms(-15.15%)   -5.77 ms(-27.85%)
main/post_zipcode          fetchIndex     +265(+0.04%)    -78.05 μs(-4.68%)   -57.34 μs(-4.86%)   -262.14 μs(-8.29%)   +1.28 ms(+16.32%)   -2.62 ms(-19.90%)   -4.85 ms(-28.68%)
spike/get_zipcode          fetchIndex    -1650(-0.40%)   -147.70 μs(-7.55%)   -81.92 μs(-6.41%)  -425.98 μs(-10.83%)   -65.54 μs(-0.60%)   -3.15 ms(-19.51%)   -5.64 ms(-26.71%)
spike/get_zipcode_by_city  fetchDetails  -1259(-0.31%)   -170.61 μs(-2.37%)  -131.07 μs(-2.02%)   -327.68 μs(-3.47%)  -851.97 μs(-5.18%)   -3.67 ms(-16.28%)    -2.36 ms(-7.38%)

However, in situations under heavy stress, space/time might work slightly better:

Benchmark details: Plaintext, unbounded throughput workload, with Quarkus assigned NUMA CPUs at ~95% usage. Invoked with wrk -t 3 -c 24 -H 'accept: text/plain' -d 180s http://x.x.x.x:8080/hello

Screen Shot 2022-09-19 at 19 15 21

The important detail above is requests. Space/time showed a 2.3% better throughput than adaptive at this high CPU usages. At these levels the latencies cannot be fully trusted, but this gives us guidance that once things are under stress adaptive does not perform worse in a big way.

One more difference between space/time and adaptive is the survivor space area. Space/time has 0 survivor spaces hardcoded, which means that as soon as an object survives a single young collection, it gets promoted to the old generation. This early promotion can potentially favor GC nepotism and combined with the reduced frequency of full GCs, it can potentially lead to big spikes on memory consumption that only come down when a full GC happens. Adaptive on the other hand has full survivor space support, so short lived objects have a bigger opportunity to be garbage collected. On top of that, adaptive tracks the pause time in consecutive minor collections and if a threshold is passed, a full GC is scheduled regardless. This more aggressive behavior can mitigate the existence of potential GC nepotism issues.

A few years ago it was common to see developers setting -Xmx and -Xms to the same value when running JVM applications. They were doing so to increase predictability of the JVM. Things have moved on since then in the JVM world, but I ran some tests to see what impact this would have on native executables. With space/time, applying these settings delayed full GCs longer compared to just passing in -Xmx, which led to more memory consumption but no performance gains. Using these settings with adaptive has no impact on memory consumption nor in performance.

Alternatives

  • Do nothing, keep using space/time GC policy and emit some warning when -Xmx is not passed (feasibility needs to be checked). I don’t think the out-of-the-box experience is as good as adaptive (see reasons above) and in my experiments it didn’t appear to be clearly better than adaptive. Also, the lack of survivor spaces in space/time means young lived objects will easily build up in the old generation until full GC.
  • Space/time hardcodes survivor spaces to be 0. I tried to change that hardcoded value to a higher value, but that’s not enough for space/time to start using survivor spaces. It requires more plumbing to get it to use survivor spaces.
  • I tried using bigger young space sizes with space/time, to see if this would enable more objects to be garbage collected during a young gen collection and hence reduce the amount of objects promoted to old gen, but this didn’t improve things and it could even make things worse. When you give ~50% of max heap to the young generation, all collections become full GCs, resulting in performance degradation and increased memory consumption.

Risks

I’ve only run a very small amount of workloads to compare adaptive vs space/time, so there could be a chance that we discover workloads where space/time is a clear winner. In those cases we can fallback on the space/time GC policy, but we could also explore the option of exposing some of the main configuration options in the adaptive GC policy via the command line. For example, by increasing adaptive’s GC_TIME_RATIO setting, we could achieve less full GC events, which is one of space/time’s main features.

Speaking with Peter Hofer (Oracle), he indicated that they want to switch serial GCs old generation to be mark-sweep-compact rather than copying, but he didn’t indicate when that would happen. This doesn’t seem to be something that will happen in the near future, but if it does happen, we will need to reevaluate the cost of full GCs and decide if we go back to space/time or make adaptive generate less full GC events.

Staying with space/time could also be a risk. Setting it up requires passing in a -H: which is not a public API. Paul Woegerer (Oracle) has been wondering what the reason for using space/time is oracle/graal#4862 (comment) and I provided an answer oracle/graal#4862 (comment). If there’s a clear indication that space/time is something that is beneficial, we should talk with the GraalVM team to have a higher level option for setting it up.

@Sanne
Copy link
Member

Sanne commented Sep 29, 2022

👍 let's switch to the default. @galderz will you be sending a PR?

@galderz
Copy link
Member Author

galderz commented Sep 29, 2022

@Sanne Yes I will, in the next day or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants