Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime errors caused by race conditions? #1

Closed
elexisvenator opened this issue Dec 30, 2023 · 4 comments
Closed

Runtime errors caused by race conditions? #1

elexisvenator opened this issue Dec 30, 2023 · 4 comments

Comments

@elexisvenator
Copy link

elexisvenator commented Dec 30, 2023

Hi @ChristopherBiscardi

H have just completed your 2048 course and quite enjoyed it. However, I noticed that the game has a few bugs in it that seem to be related to a multithreaded race condition of some sort.

If you run the game and spam arrow keys fast, there is a chance that the game would crash. The exact error varies but is usually either an error trying to do something with an entity that doesn't exist or a numeric overflow. I have also seen other strange behaviour, such as tiles being spawned outside of the game grid, and the game falsely entering an endgame state when moving a full board with valid moves.

This issue happens with the 0.11 version of the code from the course as well as the latest 0.12 code from this repo.

I suspect this is caused by multiple systems acting on the tiles at the same time, with issues being caused by trying to act on an entity when that existed when the system was called but either doesn't exist at that point or has had it's position moved. This sounds like typical concurrency headaches that I was hoping that a system like bevy would largely hide away from us. Is there something in the design of the game that is prone to these sort of issues, or is there a potential lesson here for how to address them?

Thanks, and happy new year,
Ben

racecondition

@ChristopherBiscardi
Copy link
Contributor

I did some light investigation and here's what I'm aware of so far:

I found it easier to replicate these issues on Windows, although the only issue I wasn't able to replicate after more effort across platforms is the Transform/odd tile location issue which I observed on Windows and not Mac (I did not test Linux today, since Mac did replicate and that is close enough for me).

Underflow

The underflow specifically is happening here. I've reproduced it for both directions that require subtraction from the "far side" (Right and Up). This function is only used in one place which is the board_shift.

I need to investigate this a bit more to determine exactly why the underflow is occurring. Over time I've leaned more and more towards using signed integers for all my tile-based grids as this tends to make the math easier to work with. A checked_subtraction could also be a good idea here to surface the error with more information. Most notably, I'd like to see tracing::error! used if this issue occurs with some additional information about why it is occurring.

Tile Transform oddities

I do not know why the tiles sometimes end up overlapping or getting sent off the entire screen. Its possible there's an interpolation bug or other issue in bevy_easings. Reducing the animation time from 100ms to 10ms has a great effect on increasing the frequency of the bundle insertion issue at least.

error[B0003]: Could not insert a bundle (of type \`bevy_easings::EasingComponent<bevy_transform::components::transform::Transform>`) for entity 34v0 because it doesn't exist in this World.

Initial testing seems to indicate that placing the render_tiles system in the PostUpdate schedule completely solves the bundle insert issue.

.add_systems(
    Update,
    (
        render_tile_points,
        board_shift,
        new_tile_handler,
        end_game,
    )
        .run_if(in_state(RunState::Playing)),
)
.add_systems(PostUpdate, render_tiles.run_if(in_state(RunState::Playing)))

continuing that thread, placing the render_tiles system in the next schedule says to me that apply_deferred needs to run before rendering. This makes sense since apply_deferred is where Commands are applied, and that happens between Schedules like this by default. I have confirmed this by using apply_deferred and manually chaining the systems with it in between.

((
    render_tile_points,
    board_shift,
    new_tile_handler,
    end_game,
),
  apply_deferred,
  render_tiles
).chain()

Its notable that the documentation for PostUpdate changed in 0.11 and now includes the following phrasing which makes me believe that the right approach in a "post-schedule v3" world is to put this rendering in PostUpdate since it is responding to the tile updates in Update.

PostUpdate exists to do “engine/plugin response work” to things that happened in Update.

This also looks like it is changing in 0.13 to be automatic based on system parameters (ie: Commands), so we'll have to keep an eye on it for the next iteration of the workshop to see if updates are appropriate.

early game over

This is almost certainly an ordering issue. The logic for has_move requires that the tiles available in the query are the same as the tiles on the board, which isn't necessarily true if there are despawn commands that are pending.

No matter exactly how it is occurring, the end-game behavior is a fantastic example to use for showing how to test bevy apps and systems. It is both tedious to do manually and only occurs in certain conditions after a longer period of play.

Workshop moving forward

The changes I'd like to make moving forward so far are:

  1. Use the tracing infrastructure more deeply in future versions of the workshop
    • This will make it easier for people to debug issues, and is generally a good practice anyway.
  2. Show how to set up testing of bevy systems and a resolution for the end-game bug. I think this is a particularly good candidate because the early testing seems to indicate a logic bug.
  3. Move render_tiles to PostUpdate and explain why in the relevant lesson. This will likely end up being in the same lesson as adding bevy_easings since the bundle issue only shows up at that time.

I have more investigation to do on the underflow issue specifically, as I spent most of my time on the others here.

This sounds like typical concurrency headaches that I was hoping that a system like bevy would largely hide away from us

As far as I can tell, these are logic bugs and the system ordering issues should likely be fixed. If a tile is going to be despawned, it can not then be used in an "end game" calculation before it is despawned (but after the Position component updated) and should not have an "easing bundle" applied to move it around. I can't find a reference to link to, but my current understanding is that a Component update (like Position) would be visible to the very next system that ran whereas a Command is only applied when apply_deferred runs, and that this is effectively the issue for end_game.

However! Within the last 3 months a try_insert was added which could solve the issue for the easings bundle without system re-ordering.

I'm a bit conflicted about using methods like try_insert in educational material specifically (in real-world bevy apps I assume this method could be required due to real-world constraints or preferences). On one hand it can be very useful for "apply only if the entity still exists" and on the other educational material should likely be built with correct system ordering, etc. I haven't made a decision yet for which direction I'm going to go with the workshop, although I'm currently leaning towards fixing the system order.

Happy New Year and thanks for the issue. I'll use this to improve the workshop. I'm planning to split this out into a couple issues after doing some more investigation.

@ChristopherBiscardi
Copy link
Contributor

While working on this, I had a realization about the game itself. 2048 is a game where the board tiles move then the new tiles are added. It sounds obvious typing this out but this ordering is inherent to the way the game works, so requiring the systems handling board-shifting, new tile handling, and even end-game detection to be done in-order is "how the game works" and not actually an optional feature of the implementation. This makes the original implementation in the workshop logically invalid, even though it mostly works.

Bevy has a really straightforward solution to this these days in a post Schedule v3 world: chain.

.add_systems(
    Update,
    (
        board_shift,
        new_tile_handler,
        render_tile_points,
        render_tiles,
        end_game
    )
        .chain()
        .run_if(in_state(RunState::Playing)),

I've confirmed that this in fact fixes all of the above issues, including a test that panics depending on system ordering fairly consistently (but not always) without chain, and never panics with .chain().

@ChristopherBiscardi
Copy link
Contributor

The aforementioned test looks like:

#[cfg(test)]
mod tests {
    use bevy::ecs::system::CommandQueue;

    use super::*;

    #[test]
    fn insert_bundle_on_removed_tile() {
        let mut app = App::new();
        let board = Board::new(4);
        app.insert_resource(Board::new(4));
        app.add_event::<NewTileEvent>();
        app.init_resource::<Game>();
        app.add_plugins((
            EasingsPlugin,
            bevy::time::TimePlugin,
        ));
        app.add_systems(
            Startup,
            (setup, spawn_board).chain(),
        );
        app.add_systems(
            Update,
            (board_shift, render_tiles).chain(),
        );

        // insert tiles to set up a game

        let mut command_queue = CommandQueue::default();

        let mut commands =
            Commands::new(&mut command_queue, &app.world);
        spawn_tile(
            &mut commands,
            &board,
            Position { x: 0, y: 0 },
        );
        spawn_tile(
            &mut commands,
            &board,
            Position { x: 0, y: 0 },
        );
        command_queue.apply(&mut app.world);

        // move board up
        let mut input = ButtonInput::<KeyCode>::default();
        input.press(KeyCode::ArrowUp);
        app.insert_resource(input);

        // Run systems
        app.update();

        // move board left
        let mut input = ButtonInput::<KeyCode>::default();
        input.press(KeyCode::ArrowLeft);
        app.insert_resource(input);

        // Run systems
        app.update();

        // Clear the `just_pressed` status for all `KeyCode`s
        app.world
            .resource_mut::<ButtonInput<KeyCode>>()
            .clear();

        app.update();
    }
}

@ChristopherBiscardi
Copy link
Contributor

I added a game over test and the chain functionality in 1687150 for the bevy 0.13 workshop edition, so I'm going to close this as it is part of the next version of the workshop now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants