Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazily allocate TypedArena's first chunk #36592

Merged
merged 1 commit into from
Sep 22, 2016
Merged

Conversation

nnethercote
Copy link
Contributor

Currently TypedArena allocates its first chunk, which is usually 4096
bytes, as soon as it is created. If no allocations are ever made from
the arena then this allocation (and the corresponding deallocation) is
wasted effort.

This commit changes TypedArena so it doesn't allocate the first chunk
until the first allocation is made.

This change speeds up rustc by a non-trivial amount because rustc uses
TypedArena heavily: compilation speed (producing debug builds) on
several of the rustc-benchmarks increases by 1.02--1.06x. The change
should never cause a slow-down because the hot alloc function is
unchanged. It does increase the size of TypedArena by one usize
field, however.

The commit also fixes some out-of-date comments.

@rust-highfive
Copy link
Collaborator

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

@nnethercote
Copy link
Contributor Author

Some more details. For hyper.0.5.0 this reduces cumulative heap allocations
like so:

Before: 12,326,769,350 bytes in 13,849,772 blocks
After:   5,264,559,086 bytes in 10,361,847 blocks

(These measurements are from Valgrind's DHAT tool, which I used to identify
this problem.)

This is due to rustc's frequent use of CtxtArenas, which contains 10(!)
TypedArenas that are rarely used. When unused, the arena memory isn't
touched, but the cost of many malloc/free pairs is non-trivial. Here are the
speedups for the larger rustc-benchmarks on my Linux box.

opt stage1 rustc (w/glibc malloc) producing debug builds:
- hyper.0.5.0                          6.167s vs  5.927s --> 1.040x faster
- html5ever-2016-08-25                 8.511s vs  8.296s --> 1.026x faster
- regex.0.1.30                         2.970s vs  2.797s --> 1.062x faster
- piston-image-0.10.3                 13.848s vs 13.224s --> 1.047x faster
- rust-encoding-0.3.0                  3.654s vs  3.558s --> 1.027x faster

opt stage2 rustc (w/jemalloc) producing debug builds:
- hyper.0.5.0                          5.271s vs  5.188s --> 1.016x faster
- html5ever-2016-08-25                 6.957s vs  6.775s --> 1.027x faster
- regex.0.1.30                         2.518s vs  2.448s --> 1.029x faster
- piston-image-0.10.3                 11.689s vs 11.444s --> 1.021x faster
- rust-encoding-0.3.0                  3.276s vs  3.268s --> 1.002x faster

The stage2 improvements are smaller, presumably because jemalloc is faster at
doing unnecessary malloc/free operations.

Copy link
Member

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an official Rust reviewer, but some general thoughts I had when looking through the code.

let prev_capacity = chunks.last().unwrap().storage.cap();
let new_capacity = prev_capacity.checked_mul(2).unwrap();
if chunks.last_mut().unwrap().storage.double_in_place() {
if chunks.len() == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer Vec::is_empty()

for mut chunk in chunks_borrow.drain(..last_idx) {
let cap = chunk.storage.cap();
chunk.destroy(cap);
if chunks_borrow.len() > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer !chunks_borrow.is_empty().

chunk.destroy(cap);
if chunks_borrow.len() > 0 {
let last_idx = chunks_borrow.len() - 1;
self.clear_last_chunk(&mut chunks_borrow[last_idx]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not chunks_borrow.last_mut()? It might conflict with the drain below, in which case you can use split_at_mut I think.

for chunk in chunks_borrow.iter_mut() {
let cap = chunk.storage.cap();
chunk.destroy(cap);
if chunks_borrow.len() > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer is_empty.

@Mark-Simulacrum
Copy link
Member

The code (both preexisting and current) also duplicates the pop last element, then drain/mutably iterate and destroy the rest of the chunks. Can it be extracted into a helper function?

I've discussed this with @nnethercote, I think they believe that this would be best done as a follow up PR; leaving this comment here so this idea doesn't get lost.

@nnethercote
Copy link
Contributor Author

I replaced the len() > 0 expressions in the original version with is_empty().

@bluss
Copy link
Member

bluss commented Sep 20, 2016

In the libcollections convention, with_capacity is for explicit up front allocation of capacity, while new is welcome to not allocate anything. Since the compiler exclusively uses new what I can see, is it not best to follow the convention here?

Alternatively, with_capacity is already a bit outside the convention since it's really, with_chunk_size with_chunk_capacity or something like that, so it can be renamed.

Either way, doc comments need updates to not say "preallocated" for new and with_capacity.

@bluss
Copy link
Member

bluss commented Sep 20, 2016

The is_empty tests seem to be inverted

@nnethercote
Copy link
Contributor Author

Thank you for the comments, @bluss. I fixed the inverted is_empty tests and updated the comments for new and with_capacity.

@TimNN
Copy link
Contributor

TimNN commented Sep 20, 2016

I did a quick grep over the rust source code and I don't think TypedArena::with_capacity is ever used, so it may be possible to just remove it entirely.

let prev_capacity = chunks.last().unwrap().storage.cap();
let new_capacity = prev_capacity.checked_mul(2).unwrap();
if chunks.last_mut().unwrap().storage.double_in_place() {
if chunks.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would too like to rewrite this to use the .last_mut() option for control flow. (None is the empty case). But it needs some wrangling to be able to call chunks.push at the end.

let cap = chunk.storage.cap();
chunk.destroy(cap);
if !chunks_borrow.is_empty() {
let mut last_chunk = chunks_borrow.pop().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pop's Option can be used for control flow

for mut chunk in chunks_borrow.drain(..last_idx) {
let cap = chunk.storage.cap();
chunk.destroy(cap);
if !chunks_borrow.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use .pop() here too, drain all other chunks, then put the last chunk back (seems like the simplest way to keep the borrow checker happy).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not more work than what drain already does.

Copy link
Member

@bluss bluss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the Options for control flow will end up with prettier Rust code

@bluss
Copy link
Member

bluss commented Sep 20, 2016

(I haven't ever used the review feature before. I haven't heard any news on how we want to use it in the project.)

Currently `TypedArena` allocates its first chunk, which is usually 4096
bytes, as soon as it is created. If no allocations are ever made from
the arena then this allocation (and the corresponding deallocation) is
wasted effort.

This commit changes `TypedArena` so it doesn't allocate the first chunk
until the first allocation is made.

This change speeds up rustc by a non-trivial amount because rustc uses
`TypedArena` heavily: compilation speed (producing debug builds) on
several of the rustc-benchmarks increases by 1.02--1.06x. The change
should never cause a slow-down because the hot `alloc` function is
unchanged. It does increase the size of `TypedArena` by one `usize`
field, however.

The commit also fixes some out-of-date comments.
@nnethercote
Copy link
Contributor Author

I made the requested control flow changes. I haven't changed with_capacity, though I'm happy to remove it if there is consensus there.

@nnethercote
Copy link
Contributor Author

Note that clear and drop are now very similar, although one uses drain and the other iter_mut. I don't know if that similarity can be factored out.

@bluss
Copy link
Member

bluss commented Sep 20, 2016

@bors r+

@bors
Copy link
Contributor

bors commented Sep 20, 2016

📌 Commit 80a4477 has been approved by bluss

@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Sep 20, 2016
@arielb1
Copy link
Contributor

arielb1 commented Sep 20, 2016

Nice catch @nnethercote! The redundant arenas used to not matter because we had 1 CtxtArenas struct per compiler run, but we missed the overhead when we moved to 1 arena/function.

@nnethercote
Copy link
Contributor Author

Now that I have a better idea of how rustc-benchmarks works, here are some
updated numbers. This is with a rustc configured with '--enable-optimize
--enable-debuginfo', producing debug builds.

stage 1 (uses glibc malloc)

futures-rs-test-all                  4.925s vs  4.755s --> 1.036x faster
helloworld                           0.220s vs  0.221s --> 0.995x faster
html5ever-2016-08-25                23.086s vs 22.216s --> 1.039x faster
hyper.0.5.0                         21.441s vs 20.491s --> 1.046x faster
inflate-0.1.0                        5.083s vs  4.860s --> 1.046x faster
issue-32062-equality-relations-c...  0.397s vs  0.396s --> 1.003x faster
issue-32278-big-array-of-strings     1.839s vs  1.837s --> 1.001x faster
jld-day15-parser                     5.805s vs  5.656s --> 1.026x faster
piston-image-0.10.3                 28.530s vs 27.061s --> 1.054x faster
regex.0.1.30                         2.975s vs  2.798s --> 1.063x faster
rust-encoding-0.3.0                  3.571s vs  3.537s --> 1.010x faster
syntex-0.42.2                       52.195s vs 49.760s --> 1.049x faster
syntex-0.42.2-incr-clean            52.023s vs 49.806s --> 1.045x faster

stage2 (uses jemalloc)

futures-rs-test-all                  4.283s vs  4.188s --> 1.023x faster
helloworld                           0.222s vs  0.221s --> 1.005x faster
html5ever-2016-08-25                17.508s vs 17.154s --> 1.021x faster
hyper.0.5.0                         17.506s vs 17.164s --> 1.020x faster
inflate-0.1.0                        4.410s vs  4.380s --> 1.007x faster
issue-32062-equality-relations-c...  0.366s vs  0.362s --> 1.011x faster
issue-32278-big-array-of-strings     1.636s vs  1.650s --> 0.992x faster
jld-day15-parser                     4.698s vs  4.646s --> 1.011x faster
piston-image-0.10.3                 23.283s vs 22.819s --> 1.020x faster
regex.0.1.30                         2.527s vs  2.460s --> 1.027x faster
rust-encoding-0.3.0                  3.279s vs  3.315s --> 0.989x faster
syntex-0.42.2                       42.986s vs 42.215s --> 1.018x faster
syntex-0.42.2-incr-clean            43.079s vs 42.134s --> 1.022x faster

With glibc malloc they're mostly in the range 1.03--1.06x faster. With jemalloc
they're mostly in the range 1.01--1.03x faster. The couple that look slower are
just due to measurement noise.

sophiajt pushed a commit to sophiajt/rust that referenced this pull request Sep 21, 2016
Lazily allocate TypedArena's first chunk

Currently `TypedArena` allocates its first chunk, which is usually 4096
bytes, as soon as it is created. If no allocations are ever made from
the arena then this allocation (and the corresponding deallocation) is
wasted effort.

This commit changes `TypedArena` so it doesn't allocate the first chunk
until the first allocation is made.

This change speeds up rustc by a non-trivial amount because rustc uses
`TypedArena` heavily: compilation speed (producing debug builds) on
several of the rustc-benchmarks increases by 1.02--1.06x. The change
should never cause a slow-down because the hot `alloc` function is
unchanged. It does increase the size of `TypedArena` by one `usize`
field, however.

The commit also fixes some out-of-date comments.
sophiajt pushed a commit to sophiajt/rust that referenced this pull request Sep 21, 2016
Lazily allocate TypedArena's first chunk

Currently `TypedArena` allocates its first chunk, which is usually 4096
bytes, as soon as it is created. If no allocations are ever made from
the arena then this allocation (and the corresponding deallocation) is
wasted effort.

This commit changes `TypedArena` so it doesn't allocate the first chunk
until the first allocation is made.

This change speeds up rustc by a non-trivial amount because rustc uses
`TypedArena` heavily: compilation speed (producing debug builds) on
several of the rustc-benchmarks increases by 1.02--1.06x. The change
should never cause a slow-down because the hot `alloc` function is
unchanged. It does increase the size of `TypedArena` by one `usize`
field, however.

The commit also fixes some out-of-date comments.
@eddyb
Copy link
Member

eddyb commented Sep 21, 2016

@nnethercote FWIW I've been meaning to eventually move to a single common drop-less arena (instead of a dozen typed ones), but there were things to rework first to make that even possible - we're almost there, in fact Ty only has TraitObject left that's not POD (I think I want that to be a slice of existential predicates), and everything else has at most a Vec somewhere (which can become an arena slice).

_own: PhantomData,
}
TypedArena {
first_chunk_capacity: cmp::max(1, capacity),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If with_capacity isn't used, I think it'd be worth just not having first_chunk_capacity around at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I'll file a follow-up PR to remove with_capacity once this one lands.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this PR would be simpler if it also did that change, I'm saying. I'd r+ it immediately and this PR will have to wait at least half a day more before getting merged, so you have time now.

@bors
Copy link
Contributor

bors commented Sep 22, 2016

⌛ Testing commit 80a4477 with merge b2627b0...

bors added a commit that referenced this pull request Sep 22, 2016
Lazily allocate TypedArena's first chunk

Currently `TypedArena` allocates its first chunk, which is usually 4096
bytes, as soon as it is created. If no allocations are ever made from
the arena then this allocation (and the corresponding deallocation) is
wasted effort.

This commit changes `TypedArena` so it doesn't allocate the first chunk
until the first allocation is made.

This change speeds up rustc by a non-trivial amount because rustc uses
`TypedArena` heavily: compilation speed (producing debug builds) on
several of the rustc-benchmarks increases by 1.02--1.06x. The change
should never cause a slow-down because the hot `alloc` function is
unchanged. It does increase the size of `TypedArena` by one `usize`
field, however.

The commit also fixes some out-of-date comments.
@bors bors merged commit 80a4477 into rust-lang:master Sep 22, 2016
bors added a commit that referenced this pull request Sep 24, 2016
[breaking-change] Remove TypedArena::with_capacity

This is a follow-up to #36592.

The function is unused by rustc. Also, it doesn't really follow the
usual meaning of a `with_capacity` function because the first chunk
allocation is now delayed until the first `alloc` call.

This change reduces the size of `TypedArena` by one `usize`.

@eddyb: we discussed this on IRC. Would you like to review it?
@nnethercote nnethercote deleted the TypedArena branch October 7, 2016 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants