Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix races around pthread exit and join #409

Merged
merged 2 commits into from
Jun 7, 2023
Merged

Conversation

yamt
Copy link
Contributor

@yamt yamt commented Apr 14, 2023

No description provided.

@yamt yamt marked this pull request as ready for review April 14, 2023 13:40
Copy link
Collaborator

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other locations such as here which alias __tl_{lock,unlock} to a dummy function. (and a few others as well). Should those be removed?

Another location I see is this one which by avoiding __tl_sync it's not properly synchronizing with __pthread_exit because as soon as this line is executed then the thread is eligible for freeing which would be a use-after free with later self->foo writes (or stack writes).

Additionally I'll still point out that despite the example in #405 not using detached threads I don't think that free is sufficient to deallocate a thread's stack. There's no guarantee that the stack isn't in use at the end of free or at the end of __pthread_exit when everything is cleaned up, meaning it could still be a use-after-free.

It seems like wasi-libc was originally patched to not use threads, but when re-adding threads support it seems like it might be worthwhile to start from scratch or otherwise comprehensively review all threading-related patches, although I'm not sure if that's possible.

Comment on lines 31 to 46
# Unlock thread list. (as CLONE_CHILD_CLEARTID would do for Linux)
#
# Note: once we unlock the thread list, our "map_base" can be freed
# by a joining thread. It's safe as we are in ASM and no longer use
# our C stack or pthread_t.
i32.const __thread_list_lock
i32.const 0
i32.atomic.store 0
# As an optimization, we can check tl_lock_waiters here.
# But for now, simply wake up unconditionally as
# CLONE_CHILD_CLEARTID does.
i32.const __thread_list_lock
i32.const 1
memory.atomic.notify 0
drop

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading over the definition of __tl_unlock it additionally decrements tl_lock_count. It might be good to assert that is zero in C before returning here to perform this. Additionally it might be good to leave a comment in __tl_unlock that this needs updating if the implementation changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call __tl_unlock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading over the definition of __tl_unlock it additionally decrements tl_lock_count. It might be good to assert that is zero in C before returning here to perform this. Additionally it might be good to leave a comment in __tl_unlock that this needs updating if the implementation changes.

i feel it belongs to upstream musl, not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call __tl_unlock?

because __tl_unlock is a C function, which potentially requires C stack.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by it belongs in upstream musl? Wasi-libc is already patching musl a fair bit, so what I'm saying is that as part of the patching the cases where __pthread_exit returns should all assert that the tl_lock_count is zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by it belongs in upstream musl? Wasi-libc is already patching musl a fair bit, so what I'm saying is that as part of the patching the cases where __pthread_exit returns should all assert that the tl_lock_count is zero.

i meant tl_lock_count should be zero when a thread exits is not wasi-specific.
if you feel strongly i can add such an assertion.

@yamt
Copy link
Contributor Author

yamt commented Apr 14, 2023

There are other locations such as here which alias __tl_{lock,unlock} to a dummy function. (and a few others as well). Should those be removed?

Another location I see is this one which by avoiding __tl_sync it's not properly synchronizing with __pthread_exit because as soon as this line is executed then the thread is eligible for freeing which would be a use-after free with later self->foo writes (or stack writes).

my understanding is that these weak symbols are resolved to the real __tl_lock/unlock
when pthread_create.o is linked.

Additionally I'll still point out that despite the example in #405 not using detached threads I don't think that free is sufficient to deallocate a thread's stack. There's no guarantee that the stack isn't in use at the end of free or at the end of __pthread_exit when everything is cleaned up, meaning it could still be a use-after-free.

i agree it's broken. i have a few ideas to fix it. i intentionally didn't include a fix in this PR because it's a separate issue.

* See also set_tid_address(2)
*
* In WebAssembly, we leave it to wasi_thread_start instead.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we instead call __tl_unlock right before a_store(&self->detach_state, DT_EXITED); like we do right be free(self->map_base);?

In the first case we are unlocking right before notifying the joiner that they can call free and in the later case we are unlocking right before we call free ourselves, so it seem symmetrical.

(I'm suggesting this so that we can avoid re-implementing __tl_unlock in asm if we can).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what was wrong with the single __tl_unlock on line 172 that we had previously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we instead call __tl_unlock right before a_store(&self->detach_state, DT_EXITED); like we do right be free(self->map_base);?

In the first case we are unlocking right before notifying the joiner that they can call free and in the later case we are unlocking right before we call free ourselves, so it seem symmetrical.

  • the detached case is broken as alex said.
  • detached threads do never have joiners.

(I'm suggesting this so that we can avoid re-implementing __tl_unlock in asm if we can).

this PR doesn't re-implement __tl_unlock.
it emulates CLONE_CHILD_CLEARTID, which the __tl_lock/unlock/sync protocol relies on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what was wrong with the single __tl_unlock on line 172 that we had previously?

  • double unlock.
  • it unlocks too early and allows joiner to free our stack before we finish on it.

Copy link
Member

@sbc100 sbc100 Apr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the double unlock thing looks like a bug.

But the joiner is not waiting on the tl_lock is it? The joiner seems to be waiting on t->detach_state. In fact, I don't see the tl_lock referenced at all in pthread_join.c. Maybe I'm missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the joiner uses __td_sync to sync with the exit.

@sbc100
Copy link
Member

sbc100 commented Apr 15, 2023

I wonder if we should instead come up with a different solution that doesn't involve a thread free'ing its own stack. How about this:

  1. Reserve to first word of each stack to mean "ready for collection"
  2. When exiting a detached thread, the current stack is added to a linked list of "orphaned stacks" (this can happen in C code)
  3. At the end of __wasi_thread_start_C atomically set the current stack a ready for collection.
  4. Whenever a new thread is started (or whenever __tl_lock is called?) the calling thread walks the list of "orphaned stacks" and calls free on any stack marked as "ready for collection".

@alexcrichton
Copy link
Collaborator

my understanding is that these weak symbols are resolved to the real __tl_lock/unlock
when pthread_create.o is linked.

Aha you are correct, I was misunderstanding what a weak alias did. Disregard these comments of mine then!

I wonder if we should instead come up with a different solution that doesn't involve a thread free'ing its own stack. How about this:

Is the "ready for collection" flag necessary in that scheme? Given that the linked list is protected by __tl_lock() it shouldn't be possible to witness an entry of the list which isn't collectable I think? (assuming detached threads are updated to skip the explicit __tl_unlock and fall through to the asm that is added in this PR)

It seems a bit of a bummer though in that if you have a bunch of threads all exit at once and you never create another thread then that's a bunch of memory stuck which can't be reused. Another possible alternative though would be to open-code the mutex around free a bit where exiting a detached thread would look like:

  • In C, acquire the free lock
  • In C, return to asm
  • In asm, set the stack pointer to "close to the end of the stack"
  • In asm, call the internal free function which doesn't acquire a lock
  • In asm, release the free lock
  • In asm, release the __tl_lock() lock

The weird part though is that free is running on a stack that it's actively freeing so it relies a bit on the updates to that memory to only modify possibly the header or footer of the block of memory, nothing in the middle, otherwise the stack of free itself could be corrupted. Alternatively though there could be a small static which free is executed on as a stack pointer.

@yamt
Copy link
Contributor Author

yamt commented Apr 17, 2023

I wonder if we should instead come up with a different solution that doesn't involve a thread free'ing its own stack.

for detached case, i'm thinking:

a. implement __unmapself equivalent for wasi as well
b. do something like the following in the relevant part of __pthread_exit. (and probably free it in pthread_create as well.)

free(lazy); // lazy is a global var
lazy = self->map_base1
self->map_base = NULL;

right now i'm inclined in b. because free can require more complex context than munmap.
anyway i plan to do it in a separate PR.

@sbc100
Copy link
Member

sbc100 commented Apr 17, 2023

I wonder if we should instead come up with a different solution that doesn't involve a thread free'ing its own stack.

for detached case, i'm thinking:

a. implement __unmapself equivalent for wasi as well b. do something like the following in the relevant part of __pthread_exit. (and probably free it in pthread_create as well.)

free(lazy); // lazy is a global var
lazy = self->map_base1
self->map_base = NULL;

right now i'm inclined in b. because free can require more complex context than munmap. anyway i plan to do it in a separate PR.

Neat idea. Sounds like that could work.

@yamt
Copy link
Contributor Author

yamt commented Apr 28, 2023

does anyone still have any concerns?

@yamt
Copy link
Contributor Author

yamt commented May 30, 2023

ping

i32.const __thread_list_lock
i32.const 1
memory.atomic.notify 0
drop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about putting all this new code a new local function called __tl_unlock_asm ? And perhaps mention in the comment why we need to asm re-implementation here.

Otherwise this lgtm now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added a comment.

Copy link
Collaborator

@abrown abrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time today staring at this trying to convince myself that this is correct. I'm not too familiar with the details of "CLONE_CHILD_CLEARTID" and the "__tl_lock/unlock/sync protocol" referred so it could be that I'm too concerned about something that is actually fine here. What is concerning though is that it is not easy to figure out what the change is doing and I suspect that will make troubleshooting difficult in the future.

Tell me if this is a correct reading: we need to control concurrent access to __thread_list_lock, which previously was done via __tl_lock, __tl_unlock and __tl_sync. Because the current thread's stack might be deallocated, we can't call __tl_unlock directly and instead do something "equivalent" in assembly code, by just atomically setting __thread_list_lock to zero and waking the next waiter. This means that there are two places munging with __thread_list_lock: the assembly code and __tl_unlock. (Why isn't the function removed then, e.g., to avoid future errors? Presumably it is used somewhere else when the stack is valid?). But the "equivalent" assembly code is also not completely equivalent: it doesn't decrement tl_lock_count as __tl_unlock does. So it appears as if there are cases where the assembly code could result in different behavior than the original __tl_unlock--i.e., when __tl_lock has been called multiple times previously. It is unclear why this different behavior is OK here.

None of this is explained in the commit comments or commit message. It feels like some kind of bug is already present (re: tl_lock_count)--if it is not, it still "appears" that way because there is no explanation as to why we can disregard tl_lock_count. And, since there is little explanation, it seems likely that if there isn't a bug already present, there could be one soon if someone again inserts a __tl_unlock in the wrong place, e.g., (since they should rely on the assembly code).

@yamt
Copy link
Contributor Author

yamt commented Jun 1, 2023

I spent some time today staring at this trying to convince myself that this is correct. I'm not too familiar with the details of "CLONE_CHILD_CLEARTID" and the "__tl_lock/unlock/sync protocol" referred so it could be that I'm too concerned about something that is actually fine here. What is concerning though is that it is not easy to figure out what the change is doing and I suspect that will make troubleshooting difficult in the future.

Tell me if this is a correct reading: we need to control concurrent access to __thread_list_lock, which previously was done via __tl_lock, __tl_unlock and __tl_sync. Because the current thread's stack might be deallocated, we can't call __tl_unlock directly and instead do something "equivalent" in assembly code, by just atomically setting __thread_list_lock to zero and waking the next waiter. This means that there are two places munging with __thread_list_lock: the assembly code and __tl_unlock. (Why isn't the function removed then, e.g., to avoid future errors? Presumably it is used somewhere else when the stack is valid?). But the "equivalent" assembly code is also not completely equivalent: it doesn't decrement tl_lock_count as __tl_unlock does. So it appears as if there are cases where the assembly code could result in different behavior than the original __tl_unlock--i.e., when __tl_lock has been called multiple times previously. It is unclear why this different behavior is OK here.

the asm code mimics CLONE_CHILD_CLEARTID, not __tl_unlock.

CLONE_CHILD_CLEARTID is a linux kernel functionality the original musl code relies on.
when the code was adapted to wasi, i guess the author thought calling __tl_unlock was a
good enough equivalent. however, it actually wasn't. (thus it introduced a race.)
this PR restores the original musl behavior by providing a closer equivalent of
CLONE_CHILD_CLEARTID.

None of this is explained in the commit comments or commit message. It feels like some kind of bug is already present (re: tl_lock_count)--if it is not, it still "appears" that way because there is no explanation as to why we can disregard tl_lock_count. And, since there is little explanation, it seems likely that if there isn't a bug already present, there could be one soon if someone again inserts a __tl_unlock in the wrong place, e.g., (since they should rely on the assembly code).

i feel it's actually clearly explained in the comment.

@sbc100
Copy link
Member

sbc100 commented Jun 1, 2023

the asm code mimics CLONE_CHILD_CLEARTID, not __tl_unlock.

I think this is the part that it took be a while to understand. At this point I think the comment explains its but early on I was confused about this point (and CLONE_CHILD_CLEARTID in general).

@yamt
Copy link
Contributor Author

yamt commented Jun 1, 2023

I'm not too familiar with the details of "CLONE_CHILD_CLEARTID"

https://www.man7.org/linux/man-pages/man2/clone.2.html seems to have a good enough explanation to read this patch:

CLONE_CHILD_CLEARTID (since Linux 2.5.49)
              Clear (zero) the child thread ID at the location pointed
              to by child_tid (clone()) or cl_args.child_tid (clone3())
              in child memory when the child exits, and do a wakeup on
              the futex at that address.

@sbc100 sbc100 merged commit aecd368 into WebAssembly:main Jun 7, 2023
yamt added a commit to yamt/wasi-sdk that referenced this pull request Aug 14, 2023
I want to include WebAssembly/wasi-libc#409
which is rather easy to hit when people are experimenting wasi-threads.
yamt added a commit to yamt/wasi-sdk that referenced this pull request Aug 15, 2023
I want to include WebAssembly/wasi-libc#409
which is rather easy to hit when people are experimenting wasi-threads.
abrown pushed a commit to WebAssembly/wasi-sdk that referenced this pull request Aug 22, 2023
I want to include WebAssembly/wasi-libc#409
which is rather easy to hit when people are experimenting wasi-threads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants