Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cg_llvm: use index-based loop in write_operand_repeatedly #112516

Merged
merged 1 commit into from
Jun 27, 2023

Conversation

erikdesjardins
Copy link
Contributor

@erikdesjardins erikdesjardins commented Jun 11, 2023

This should be easier for LLVM to analyze.

Fixes #111603

This needs a perf run.

cc @caojoshua

@rustbot
Copy link
Collaborator

rustbot commented Jun 11, 2023

r? @davidtwco

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 11, 2023
let dest_elem = dest.project_index(&mut body_bx, i);
cg_elem.val.store(&mut body_bx, dest_elem);

let next = body_bx.unchecked_uadd(i, self.const_usize(1));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add no-unsigned-wrap flag here? Do we let LLVM analyze it instead?

Copy link
Contributor Author

@erikdesjardins erikdesjardins Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unchecked_uadd (vs. add) adds nuw

current,
&[self.const_usize(1)],
);
let dest_elem = dest.project_index(&mut body_bx, i);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does project_index emit GEP? Will it emit the inbounds and align info that was previously there?

Copy link
Contributor Author

@erikdesjardins erikdesjardins Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes:

pub fn project_index<Bx: BuilderMethods<'a, 'tcx, Value = V>>(
&self,
bx: &mut Bx,
llindex: V,
) -> Self {
// Statically compute the offset if we can, otherwise just use the element size,
// as this will yield the lowest alignment.
let layout = self.layout.field(bx, 0);
let offset = if let Some(llindex) = bx.const_to_opt_uint(llindex) {
layout.size.checked_mul(llindex, bx).unwrap_or(layout.size)
} else {
layout.size
};
PlaceRef {
llval: bx.inbounds_gep(
bx.cx().backend_type(self.layout),
self.llval,
&[bx.cx().const_usize(0), llindex],
),
llextra: None,
layout,
align: self.align.restrict_for_offset(offset),
}
}

What was there before was basically a reimplementation of project_index (which was sort of necessary because the GEP was threaded through the phi)

@@ -5,6 +5,18 @@

use std::sync::Arc;

// CHECK-LABEL: @new_from_array

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we emit the entire IR here? I think its helpful for reviewers, test coverage, and people trying to understand the codebase better.

Copy link
Contributor Author

@erikdesjardins erikdesjardins Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't have something like LLVM's update_test_checks script, it is painful to update codegen tests. So we tend to use the minimal amount of CHECK lines so we don't have to update tests due to unrelated changes. (Also, most tests don't specify a target, so they generate slightly different IR per platform due to differences in ABI/ vectorization/etc., which would have to be dealt with, if we did this in general.)

In this case, this is the full IR of new_from_array:

IR
define { ptr, i64 } @new_from_array(i64 noundef %x) unnamed_addr #0 personality ptr @rust_eh_personality {
start:
  %array = alloca [1000 x i64], align 8
  %broadcast.splatinsert = insertelement <2 x i64> poison, i64 %x, i64 0
  %broadcast.splat = shufflevector <2 x i64> %broadcast.splatinsert, <2 x i64> poison, <2 x i32> zeroinitializer
  %broadcast.splatinsert1 = insertelement <2 x i64> poison, i64 %x, i64 0
  %broadcast.splat2 = shufflevector <2 x i64> %broadcast.splatinsert1, <2 x i64> poison, <2 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %start
  %index = phi i64 [ 0, %start ], [ %index.next.4, %vector.body ]
  %0 = getelementptr inbounds [1000 x i64], ptr %array, i64 0, i64 %index
  store <2 x i64> %broadcast.splat, ptr %0, align 8
  %1 = getelementptr inbounds i64, ptr %0, i64 2
  store <2 x i64> %broadcast.splat2, ptr %1, align 8
  %index.next = add nuw nsw i64 %index, 4
  %2 = getelementptr inbounds [1000 x i64], ptr %array, i64 0, i64 %index.next
  store <2 x i64> %broadcast.splat, ptr %2, align 8
  %3 = getelementptr inbounds i64, ptr %2, i64 2
  store <2 x i64> %broadcast.splat2, ptr %3, align 8
  %index.next.1 = add nuw nsw i64 %index, 8
  %4 = getelementptr inbounds [1000 x i64], ptr %array, i64 0, i64 %index.next.1
  store <2 x i64> %broadcast.splat, ptr %4, align 8
  %5 = getelementptr inbounds i64, ptr %4, i64 2
  store <2 x i64> %broadcast.splat2, ptr %5, align 8
  %index.next.2 = add nuw nsw i64 %index, 12
  %6 = getelementptr inbounds [1000 x i64], ptr %array, i64 0, i64 %index.next.2
  store <2 x i64> %broadcast.splat, ptr %6, align 8
  %7 = getelementptr inbounds i64, ptr %6, i64 2
  store <2 x i64> %broadcast.splat2, ptr %7, align 8
  %index.next.3 = add nuw nsw i64 %index, 16
  %8 = getelementptr inbounds [1000 x i64], ptr %array, i64 0, i64 %index.next.3
  store <2 x i64> %broadcast.splat, ptr %8, align 8
  %9 = getelementptr inbounds i64, ptr %8, i64 2
  store <2 x i64> %broadcast.splat2, ptr %9, align 8
  %index.next.4 = add nuw nsw i64 %index, 20
  %10 = icmp eq i64 %index.next.4, 1000
  br i1 %10, label %repeat_loop_next, label %vector.body, !llvm.loop !2

repeat_loop_next:                                 ; preds = %vector.body
  %11 = load volatile i8, ptr @__rust_no_alloc_shim_is_unstable, align 1, !noalias !5
  %12 = tail call noundef align 8 dereferenceable_or_null(8016) ptr @__rust_alloc(i64 noundef 8016, i64 noundef 8) #6, !noalias !5
  %13 = icmp eq ptr %12, null
  br i1 %13, label %bb1.i.i, label %"_ZN5alloc4sync12Arc$LT$T$GT$3new17hc22c917a7edefd8bE.exit"

bb1.i.i:                                          ; preds = %repeat_loop_next
; call alloc::alloc::handle_alloc_error
  tail call void @_ZN5alloc5alloc18handle_alloc_error17h5a822ff2e844764dE(i64 noundef 8, i64 noundef 8016) #7, !noalias !5
  unreachable

"_ZN5alloc4sync12Arc$LT$T$GT$3new17hc22c917a7edefd8bE.exit": ; preds = %repeat_loop_next
  store i64 1, ptr %12, align 8, !noalias !5
  %x.sroa.4.0._14.sroa_idx.i = getelementptr inbounds i8, ptr %12, i64 8
  store i64 1, ptr %x.sroa.4.0._14.sroa_idx.i, align 8, !noalias !5
  %x.sroa.5.0._14.sroa_idx.i = getelementptr inbounds i8, ptr %12, i64 16
  call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(8000) %x.sroa.5.0._14.sroa_idx.i, ptr noundef nonnull align 8 dereferenceable(8000) %array, i64 8000, i1 false)
  %14 = insertvalue { ptr, i64 } poison, ptr %12, 0
  %15 = insertvalue { ptr, i64 } %14, i64 1000, 1
  ret { ptr, i64 } %15
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing IR. Looks like the redundant alloca was removed. Nice!

@caojoshua
Copy link

I can confirm that this approach is what I was suggesting in the issue. I have been working on various changes within LLVM to improve analysis on pointer comparisons that would have resolved this issue. However, I still think it makes sense to make this change in rustc.

I have not worked on this project and do not want to dive deep in the code base right now. Please excuse my noob questions.

Copy link

@caojoshua caojoshua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but of course there should be other reviewers

@the8472
Copy link
Member

the8472 commented Jun 11, 2023

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 11, 2023
@bors
Copy link
Contributor

bors commented Jun 11, 2023

⌛ Trying commit bd0aae9 with merge c4c156d920b0e876d380a9464958a5e90d2d1d48...

@bors
Copy link
Contributor

bors commented Jun 11, 2023

☀️ Try build successful - checks-actions
Build commit: c4c156d920b0e876d380a9464958a5e90d2d1d48 (c4c156d920b0e876d380a9464958a5e90d2d1d48)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (c4c156d920b0e876d380a9464958a5e90d2d1d48): comparison URL.

Overall result: ❌ regressions - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.5% [0.3%, 0.5%] 4
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.2% [3.2%, 3.2%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 3.2% [3.2%, 3.2%] 1

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.1% [1.8%, 2.3%] 9
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 648.85s -> 647.756s (-0.17%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 11, 2023
@erikdesjardins
Copy link
Contributor Author

From a local cachegrind diff that looks like inlining noise:

--------------------------------------------------------------------------------
Ir       file:function
--------------------------------------------------------------------------------
 58,613  ???:rustc_middle::ty::codec::encode_with_shorthand::<rustc_middle::query::on_disk_cache::CacheEncoder, rustc_middle::ty::Ty, <rustc_middle::query::on_disk_cache::CacheEncoder as rustc_type_ir::codec::TyEncoder>::type_shorthands>
-43,266  ???:<[rustc_middle::mir::LocalDecl] as rustc_serialize::serialize::Encodable<rustc_middle::query::on_disk_cache::CacheEncoder>>::encode
 23,234  ???:<rustc_metadata::creader::CStore as rustc_session::cstore::CrateStore>::def_path_hash
-18,325  ???:<rustc_span::def_id::DefId as rustc_data_structures::stable_hasher::HashStable<rustc_query_system::ich::hcx::StableHashingContext>>::hash_stable
-13,446  ./elf/dl-lookup.c:_dl_lookup_symbol_x
-12,600  ???:<alloc::vec::Vec<rustc_middle::ty::adjustment::Adjustment> as rustc_serialize::serialize::Decodable<rustc_middle::query::on_disk_cache::CacheDecoder>>::decode
 11,456  ???:<std::collections::hash::map::HashMap<rustc_hir::hir_id::ItemLocalId, alloc::vec::Vec<rustc_middle::ty::adjustment::Adjustment>, core::hash::BuildHasherDefault<rustc_hash::FxHasher>> as rustc_serialize::serialize::Decodable<rustc_middle::query::on_disk_cache::CacheDecoder>>::decode
-11,256  ???:<&mut <rustc_middle::mir::syntax::Place as rustc_serialize::serialize::Decodable<rustc_middle::query::on_disk_cache::CacheDecoder>>::decode::{closure
 11,256  ???:<rustc_middle::mir::syntax::ProjectionElem<rustc_middle::mir::Local, rustc_middle::ty::Ty> as rustc_serialize::serialize::Decodable<rustc_middle::query::on_disk_cache::CacheDecoder>>::decode
...

It seems like await-call-tree is noisy in general:

image

Copy link
Member

@davidtwco davidtwco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidtwco
Copy link
Member

@bors r+

@bors
Copy link
Contributor

bors commented Jun 27, 2023

📌 Commit bd0aae9 has been approved by davidtwco

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 27, 2023
@bors
Copy link
Contributor

bors commented Jun 27, 2023

⌛ Testing commit bd0aae9 with merge 3c554f5...

@bors
Copy link
Contributor

bors commented Jun 27, 2023

☀️ Test successful - checks-actions
Approved by: davidtwco
Pushing 3c554f5 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Jun 27, 2023
@bors bors merged commit 3c554f5 into rust-lang:master Jun 27, 2023
@rustbot rustbot added this to the 1.72.0 milestone Jun 27, 2023
@erikdesjardins erikdesjardins deleted the loop branch July 1, 2023 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arc::new duplicates stack memory
7 participants