Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely poor performance?? #2796

Closed
benoneal opened this issue Sep 9, 2021 · 27 comments
Closed

Extremely poor performance?? #2796

benoneal opened this issue Sep 9, 2021 · 27 comments
Labels
A-Rendering Drawing game state to the screen C-Bug An unexpected or incorrect behavior C-Performance A change motivated by improving speed, memory usage or compile times
Milestone

Comments

@benoneal
Copy link

benoneal commented Sep 9, 2021

Bevy 0.5.0, Windows 10

What you did

  1. Ran bevymark and spawned around 2000 birds
  2. Ran alien cake addict with board size modified to 50x50 (from 14x21)
  3. Built alien cake addict with the release flag (to optimize the build) and re-ran it

What you expected to happen

Blisteringly high frame rates.

What actually happened

  1. A woeful 11 FPS in the benchmark
  2. Around 1 FPS in both the normal and release builds of the alien cake addict

Additional information

image
image

@benoneal benoneal added C-Bug An unexpected or incorrect behavior S-Needs-Triage This issue needs to be labelled labels Sep 9, 2021
@bjorn3
Copy link
Contributor

bjorn3 commented Sep 9, 2021

We are currently rewriting the renderer. This will result in a huge increase in rendering performance. This is the pipelined-rendering branch. AFAIK it is already usable. Can you try benchmarking the bevymark_pipelined example on this branch instead?

@benoneal
Copy link
Author

benoneal commented Sep 9, 2021

Sure. Just pulled that branch and ran that benchmark.

image

Seems around 5.5x faster, so a relatively decent improvement, but still a far cry from what I'd expect to see given what I've been able to get out of Unity's DOTS and even my own Javascript ECS experiments.

Is this normal? Is there something I can run to see a breakdown of timings per frame to isolate the bottleneck?

[Edited to add] * actually it may not be 5x faster... hard to tell, but the non-pipelined branch is rendering exactly 5 different color variants of the sprites, whereas the pipelined is only rendering 1. That may be a coincidence, but at the least it's an apples to oranges comparison right now.

@bjorn3
Copy link
Contributor

bjorn3 commented Sep 9, 2021

Batching is not yet implemented. There is a PR for sprite batching: #2642

As for benchmarking I now there is a tracing infrastructure to get a breakdown per system, but I haven't used it myself, so I don't know exactly how to use it. I believe it is a feature flag and then a profile gets written to a file in the chrome profiler format.

@benoneal
Copy link
Author

benoneal commented Sep 9, 2021

Hmm. It seems it isn't only a sprite issue. I just ran the spawner example:

image

Even 1 draw call per object here shouldn't result in 1.3 seconds per frame.

@cart
Copy link
Member

cart commented Sep 9, 2021

Just as a heads up: that spawner example is still using the old renderer, even if you are on the pipelined-rendering branch.

@NiklasEi NiklasEi added C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Investigation This issue requires detective work to figure out what's going wrong and removed S-Needs-Triage This issue needs to be labelled labels Sep 9, 2021
@parasyte
Copy link
Contributor

parasyte commented Sep 12, 2021

Another datapoint. I just tried bevymark out of curiosity on the main branch. On my machine, I get about 60 fps with just over 8,600 birbs.

I'm also testing on Windows 10 with a GeForce RTX 3090. The only major difference is my CPU: Ryzen 9 5900X.

birbs

Release profile with some extra optimizations in ~/.cargo/config:

[target.x86_64-pc-windows-msvc]
linker = "rust-lld.exe"
rustflags = ["-C", "target-cpu=native"]

Oh, and I'm using a nightly compiler:

$ rustc --version
rustc 1.57.0-nightly (8c2b6ea37 2021-09-11)

@cart
Copy link
Member

cart commented Sep 12, 2021

@benoneal with your computer's specs you should be getting way higher performance, especially using the new renderer. Are you absolutely certain you are running bevymark_pipelined in release mode? My computer is older / weaker than yours and I'm getting ~67,000 sprites (and ~8,000 sprites in the old renderer). The performance you have been reporting across the board is in-line with debug build performance, which makes me question the results a bit. What is the full command you are running on the command line?

@enbugger
Copy link

enbugger commented Sep 13, 2021

Not bragging, gloating, or anything. Just for comparison, I'll post here the Unity DOTS-related video with timecode so it may be useful: https://youtu.be/ILfUuBLfzGI?t=1245
DOTS with Burst, batching, and some PC is able to move ~100k sprites with playable FPS.

UPD: the specs in the video are shown on task manager dashboard

The screenshot

image

@sarkahn
Copy link
Contributor

sarkahn commented Sep 13, 2021

Not bragging, gloating, or anything. Just for comparison, I'll post here the Unity DOTS-related video with timecode so it may be useful: https://youtu.be/ILfUuBLfzGI?t=1245
DOTS with Burst, batching, and some PC is able to move ~100k sprites with playable FPS.

UPD: the specs in the video are shown on task manager dashboard

The screenshot

It's important to note that video is using the previous version of the dots "hybrid renderer". That version used gpu instancing, not batching. A proper implementation of a game using gpu instancing can easily render millions of sprites, even in Unity without dots, without breaking a sweat. However there's a lot of other issues that come along with gpu instancing. Namely compatibility with older hardware, and it doesn't support per instance overrides.

The current version of the hybrid renderer (v2) does use batching and does support per instance material overrides. My expectation is it would still greatly outperform bevy rendering right now as it's been worked on by a team of talented engineers for years at this point. So maybe not a fair comparison.

@cart
Copy link
Member

cart commented Sep 13, 2021

@enbugger

Even without batching the new Bevy renderer currently matches the Unity DOTS performance from the youtube video on my machine (over 100k bunnies at ~41 fps):
image

My hardware is slightly better, but its still in the same category (just about a year older):

  • CPU: Intel i7 7700k 4.2GHz
  • GPU: Nvidia GeForce GTX 1070
  • RAM: 16GB DDR4

With batching (which there is an open pr out for), I'm hitting 164k bunnies at ~41 fps
image

I promise the new renderer will have competitive performance.

A proper implementation of a game using gpu instancing can easily render millions of sprites, even in Unity without dots, without breaking a sweat. However there's a lot of other issues that come along with gpu instancing

@sarkahn
I'm gonna go ahead and doubt this until I see an equivalent benchmark (ex: user moves sprites using "game logic", gpu renders instanced sprites). You can have millions of instanced "particles" because they are driven by internal particle logic driven by a shader. If the user was hand-moving each particle, you would need to transfer that data to the gpu, which puts you back into the ballpark of "maybe over 100k if you do everything right". Or if sprites are "static" you don't need to copy data to the gpu every frame and then you can hit the "millions of sprites" mark.

@sarkahn
Copy link
Contributor

sarkahn commented Sep 13, 2021

@sarkahn
I'm gonna go ahead and doubt this until I see an equivalent benchmark (ex: user moves sprites using "game logic", gpu renders instanced sprites). You can have millions of instanced "particles" because they are driven by internal particle logic driven by a shader. If the user was hand-moving each particle, you would need to transfer that data to the gpu, which puts you back into the ballpark of "maybe over 100k if you do everything right". Or if sprites are "static" you don't need to copy data to the gpu every frame and then you can hit the "millions of sprites" mark.

The "millions" I was referring to was from static sprites being animated or moved on the gpu. However a couple of years ago when I was first started learning dots I and a bunch of other people were seeing how far we could push it with instancing. https://forum.unity.com/threads/200k-dynamic-animated-sprites-at-80fps.695809/

This was not using hybrid renderer at all - just raw DrawMeshInstancedIndirect calls and compute buffers. Near the end of the thread I was able to push my implementation to around 400k sprites at 60k, being animated on the cpu by dots (pushing sprite indices for every sprite every frame via compute buffers). The op from that thread then went on to create a version that could apparently handle a million animated sprites: https://forum.unity.com/threads/1-million-animated-sprites-at-60-fps.811116/.

Anyways, I was curious so I tried to make a unity version of bevymark using the most up to date version of all the dots packages. I tried to make it as equivalent to bevymark as I could given my knowledge of dots and bevy. It started to spike below 60fps around 120K:
brave_QUcvGwJCz0

For comparison's sake I can get to around 40K birds in bevymark using cargo run --example bevymark_pipelined --release on the latest piplined branch.

@cart
Copy link
Member

cart commented Sep 13, 2021

Thanks for putting that together! I have a couple of questions to help with the comparison:

  1. Are you sure some sprites aren't getting culled? You are randomly placing sprites, so I expect the entire screen to be filled if there are 120k sprites.
  2. What hardware are you running this on?

This was not using hybrid renderer at all - just raw DrawMeshInstancedIndirect calls and compute buffers. Near the end of the thread I was able to push my implementation to around 400k sprites at 60k, being animated on the cpu by dots (pushing sprite indices for every sprite every frame via compute buffers). The op from that thread then went on to create a version that could apparently handle a million animated sprites

You were pretty clear about this, but just to be doubly sure: these were still "statically" positioned sprites (iirc DrawMeshInstancedIndirect allows for reusing things like positions across frames) that were then fed their per-sprite-entity animation indices to select an animation-frame per-sprite-frame. So the amount of data transfered per frame was 1 integer (u8? u32?) per sprite? If so that lines up pretty well with my expectations for performance.

@cart
Copy link
Member

cart commented Sep 13, 2021

For comparison's sake I can get to around 40K birds in bevymark using cargo run --example bevymark_pipelined --release on the latest piplined branch.

Can you also try the batched rendering branch: #2642?

@sarkahn
Copy link
Contributor

sarkahn commented Sep 13, 2021

Thanks for putting that together! I have a couple of questions to help with the comparison:

  1. Are you sure some sprites aren't getting culled? You are randomly placing sprites, so I expect the entire screen to be filled if there are 120k sprites.

I reworked it a bit to ensure they're spread out a bit more - I was spawning a lot in the same spot. Same results:
devenv_LTWMeN1uEX

You can see it's mostly a solid 60fps at around 120k. Any more than that and it starts to spike a lot.

  1. What hardware are you running this on?

Intel core i7 8700 3.2
Nvidia Geforce GTX 1070
16GB RAM

You were pretty clear about this, but just to be doubly sure: these were still "statically" positioned sprites (iirc DrawMeshInstancedIndirect allows for reusing things like positions across frames) that were then fed their per-sprite-entity animation indices to select an animation-frame per-sprite-frame. So the amount of data transfered per frame was 1 integer (u8? u32?) per sprite? If so that lines up pretty well with my expectations for performance.

They were static yeah, from what I remember (it's been a while) I had set up a separate buffer for transforms, uv data, and color data. I think I could have re-used them per frame but I wasn't because I couldn't figure out a nice way to do it, it was easier to just re-create the buffer every frame and re-fill it - filling it was just a memcopy from a native array. Then all the buffers would get pushed to the material every frame: https://github.com/sarkahn/SpriteSheetRenderer/blob/Rewrite/Systems/SpriteRenderSystem.cs

Knowing what I know now I'm sure I could have been a lot smarter about it but I was learning at the time. So yeah, it was transform data (4x4 matrix), colors, and uv data, being pushed to the material every frame.

@sarkahn
Copy link
Contributor

sarkahn commented Sep 14, 2021

From pr/2642 I can get to around 86K before it starts to fall below 60FPS. That's with cargo run --example bevymark_pipelined --release

@benoneal
Copy link
Author

When I ran the main branch bevymark with and without the release flag, I noticed no difference at all, so I assumed it wouldn't make any difference in the pipelined branch either. With the release flag in the pipelined branch, I hit <60fps here:
image
And at 100k sprites I get this:
image
I couldn't figure out how to take screenshots in-game, but for context, I just ran Cyberpunk 2077 with ultra settings, quality DLSS and full RTX raytracing at 3440x1440 and was hovering around 55-65 avg fps in crowded city streets (modded to have more pedestrians and cars).

@korken89
Copy link

korken89 commented Sep 24, 2021

Hi, I am evaluating Bevy for use in a visualization tool we are working on, and I also got some weird results.
Running the example spawner in release mode I see ~8 fps on a Ryzen 4800U laptop.
The thing that's weird is that only ~10% of my CPU and ~15% GPU is being used, so it seems like it should run a lot faster?
Is there something one can try to get this better? Our visualization tool expects to load a few hundred OBJ models and then show a few thousand boxes/spheres inside these models.

I'm currently testing Kiss3D to do the same, and getting stable 60 fps with 10 000 boxes test as in spawner, but I like the ECS system of Bevy a lot.
So I'm trying to see if I can get comparable performance.

@parasyte
Copy link
Contributor

The thing that's weird is that only ~10% of my CPU and ~15% GPU is being used

If you are looking at the percentage indicators in Windows Task Manager, it will only show 100% CPU utilization when all CPU cores are completely busy. So if it is showing around 10% utilization, you probably have something like 8 or 12 cores, and only one of them is doing work.

@korken89
Copy link

Thanks, but it's more or less 10% per core (seen in htop). I'm in Linux (Arch) if that helps?
I would have expected that one CPU or the GPU world be at 100%. This is what I see in Kiss3D when coming to the limit, so I was expecting something similar in Bevy :)

@parasyte
Copy link
Contributor

parasyte commented Sep 24, 2021

What I'm getting at is that the spawner example looks like it is CPU bound on a single thread with an iterator over each cube:

for (mut transform, material_handle) in query.iter_mut() {

This is just a guess, I haven't done any profiling. The ECS should be able to parallelize updates by pushing iteration to it with simpler queries. But it will only be able to parallelize queries with shared borrows.

I would expect you to only see a single CPU core at 100% if core affinity was being used. The scheduler can (and should) schedule threads to random cores for each time slice. It keeps single-thread performance high by keeping core temperatures low.

@cart
Copy link
Member

cart commented Sep 24, 2021

I'd like to point out that any efforts to zero in on why the current renderer is slow are relatively pointless. We already know it is slow and most of the reasons why it was slow. It is getting retired in the next Bevy release. We've already done investigations into this, which have informed the design of the new renderer, which is shaping up quite nicely.

This is just a guess, I haven't done any profiling. The ECS should be able to parallelize updates by pushing iteration to it with simpler queries. But it will only be able to parallelize queries with shared borrows.

We automatically parallelize system execution based on query access. Individual queries within a system can be accessed in a parallel context and you can do parallel iteration over any query in a system.

@cart
Copy link
Member

cart commented Sep 24, 2021

And to be clear, the spawner example uses the old renderer (even on the pipelined-rendering branch).

@korken89
Copy link

korken89 commented Sep 25, 2021

Thanks for the clarification @cart !
If these issues are being solved, then I think I'll continue with Bevy for our tool and see with the 0.6 release what kind of performance boost we get.

@cart
Copy link
Member

cart commented Nov 3, 2021

I just opened a PR that adds sprite batching to the new renderer. I'm getting ~130,000 sprites at 60fps on bevymark_pipelined: #3060

@superdump
Copy link
Contributor

For anyone testing performance on their machines, as cart noted, only the new pipelined-rendering branch is relevant at this point, and only the examples that have been updated to make use of the pipelined-rendering renderer, and they are as far as I am aware all named <something>_pipelined.rs, but you can check which is being used by looking at the example file and see if it is using DefaultPlugins (old renderer) or PipelinedDefaultPlugins (new renderer).

@alice-i-cecile alice-i-cecile added A-Rendering Drawing game state to the screen and removed S-Needs-Investigation This issue requires detective work to figure out what's going wrong labels Nov 25, 2021
@alice-i-cecile alice-i-cecile added this to the Bevy 0.6 milestone Nov 25, 2021
@Weasy666
Copy link
Contributor

Weasy666 commented Dec 2, 2021

2021-12-02T14:02:15.074942Z  INFO bevymark_pipelined: counter: 124116
2021-12-02T14:02:15.076446Z  INFO bevy diagnostic: frame_time                      :    0.018201s (avg 0.016784s)
2021-12-02T14:02:15.076618Z  INFO bevy diagnostic: fps                             :   61.104787  (avg 60.368606)
2021-12-02T14:02:15.076816Z  INFO bevy diagnostic: frame_count                     : 50127.000000  (avg 50127.000000)

for me on a Ryzen 5 5600X and Radeon RX570 machine. FPS began to drop to 59.594372 when the count reached 128105.
Oh...i ran: cargo run --example bevymark_pipelined --release

@cart
Copy link
Member

cart commented Dec 2, 2021

I'm closing this out as we have merged the new renderer, which resolves performance issues generally. Obviously theres always more work to do, but we're now competitive with other projects and it is only up from here. Feel free to open more specific issues as we encounter them, such as this Mac-specific issue: #3052

@cart cart closed this as completed Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Bug An unexpected or incorrect behavior C-Performance A change motivated by improving speed, memory usage or compile times
Projects
None yet
Development

No branches or pull requests