Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI hang on ubuntu-latest #113

Closed
brettle opened this issue Jul 14, 2024 · 11 comments
Closed

CI hang on ubuntu-latest #113

brettle opened this issue Jul 14, 2024 · 11 comments

Comments

@brettle
Copy link
Member

brettle commented Jul 14, 2024

Opening this issue just in case we see something similar again...

Here's the CI log. It's particularly strange that the test appears to have just permanently frozen with no output in the middle of a test despite the 30 minute timeout on the test. It looks like it was eventually cancelled by github after the CI had been running for 6 hours.

Rerunning the job succeeded in just a few minutes.

@brettle
Copy link
Member Author

brettle commented Jul 15, 2024

It happened again. Here's the CI log.

@brettle brettle changed the title One-off CI hang on ubuntu-latest CI hang on ubuntu-latest Jul 15, 2024
@brettle
Copy link
Member Author

brettle commented Jul 16, 2024

I am seeing this occasionally locally. If I attach the C/C++ debugger to the Java process, I see this as the stack trace for one of the threads:

libc.so.6!futex_wait(unsigned int * futex_word, unsigned int expected, int private) (/usr/src/debug/glibc-2.39-17.fc40.x86_64/sysdeps/nptl/futex-internal.h:146)
libc.so.6!__GI___lll_lock_wait(int * futex, int * futex@entry, int private) (/usr/src/debug/glibc-2.39-17.fc40.x86_64/nptl/lowlevellock.c:49)
libc.so.6!lll_mutex_lock_optimized(pthread_mutex_t * mutex) (/usr/src/debug/glibc-2.39-17.fc40.x86_64/nptl/pthread_mutex_lock.c:48)
libc.so.6!___pthread_mutex_lock(pthread_mutex_t * mutex) (/usr/src/debug/glibc-2.39-17.fc40.x86_64/nptl/pthread_mutex_lock.c:93)
libhalsim_ws_server.so!__gthread_mutex_lock(__gthread_mutex_t * __mutex) (/usr/include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749)
libhalsim_ws_server.so!std::mutex::lock(class std::mutex * const this) (/usr/include/c++/11/bits/std_mutex.h:100)
libhalsim_ws_server.so!std::scoped_lock<std::mutex>::scoped_lock(std::scoped_lock<std::mutex>::mutex_type & __m, class std::scoped_lock<std::mutex> * const this) (/usr/include/c++/11/mutex:655)
libhalsim_ws_server.so!wpi::uv::AsyncFunction<void(std::function<void()>)>::Call<wpilibws::HALSimWSProviderSimDevices::DeviceCreatedCallback(char const*, HAL_SimDeviceHandle)::<lambda()> >(class wpi::uv::AsyncFunction<void(std::function<void()>)> * const this) (/work/wpinet/src/main/native/include/wpinet/uv/AsyncFunction.h:145)
libhalsim_ws_server.so!wpilibws::HALSimWSProviderSimDevices::DeviceCreatedCallback(class wpilibws::HALSimWSProviderSimDevices * const this, const char * name, HAL_SimDeviceHandle handle) (/work/simulation/halsim_ws_core/src/main/native/cpp/WSProvider_SimDevice.cpp:277)
libwpiHal.so!hal::impl::SimPrefixCallbackRegistry<void (*)(char const*, void*, int)>::Invoke<int&>(const char * name, const class hal::impl::SimPrefixCallbackRegistry<void (*)(char const*, void*, int)> * const this) (/work/hal/src/main/native/sim/mockdata/SimDeviceDataInternal.h:123)
libwpiHal.so!hal::impl::SimPrefixCallbackRegistry<void (*)(char const*, void*, int)>::operator()<char const*&, int&>(const class hal::impl::SimPrefixCallbackRegistry<void (*)(char const*, void*, int)> * const this) (/work/hal/src/main/native/sim/mockdata/SimDeviceDataInternal.h:131)
libwpiHal.so!hal::SimDeviceData::CreateDevice(class hal::SimDeviceData * const this, const char * name) (/work/hal/src/main/native/sim/mockdata/SimDeviceData.cpp:114)
libwpiHaljni.so!Java_edu_wpi_first_hal_SimDeviceJNI_createSimDevice(JNIEnv * env, jstring name) (/work/wpiutil/src/main/native/thirdparty/llvm/include/wpi/SmallVector.h:273)
[Unknown/Just-In-Time compiled code] (Unknown Source:0)

Basically, it's blocking waiting for a mutex while running HALSimWS's createDevice callback. The device being created in this case is "CANCoder:CANSparkMax[2]".

Not sure yet what other thread is holding the mutex...

@brettle
Copy link
Member Author

brettle commented Jul 16, 2024

This is the stack for the thread that is holding the mutex:

libwpiHal.so!wpi::recursive_spinlock1::try_lock(wpi::recursive_spinlock1 * const this) (/work/wpiutil/src/main/native/include/wpi/spinlock.h:56)
libwpiHal.so!wpi::recursive_spinlock1::lock(wpi::recursive_spinlock1 * const this) (/work/wpiutil/src/main/native/include/wpi/spinlock.h:71)
libwpiHal.so!std::scoped_lock<wpi::recursive_spinlock1>::scoped_lock(std::scoped_lock<wpi::recursive_spinlock1>::mutex_type & __m, std::scoped_lock<wpi::recursive_spinlock1> * const this) (/usr/include/c++/11/mutex:655)
libwpiHal.so!hal::SimDeviceData::RegisterValueCreatedCallback(hal::SimDeviceData * const this, HAL_SimDeviceHandle device, void * param, HALSIM_SimValueCallback callback, bool initialNotify) (/work/hal/src/main/native/sim/mockdata/SimDeviceData.cpp:346)
libhalsim_ws_server.so!wpilibws::HALSimWSProviderSimDevice::OnNetworkConnected(wpilibws::HALSimWSProviderSimDevice * const this, std::shared_ptr<wpilibws::HALSimBaseWebSocketConnection> ws) (/work/simulation/halsim_ws_core/src/main/native/cpp/WSProvider_SimDevice.cpp:37)
libhalsim_ws_server.so!operator()(const struct {...} * const __closure) (/work/simulation/halsim_ws_core/src/main/native/cpp/WSProvider_SimDevice.cpp:277)
libhalsim_ws_server.so!std::__invoke_impl<void, wpilibws::HALSimWSProviderSimDevices::DeviceCreatedCallback(char const*, HAL_SimDeviceHandle)::<lambda()>&>(struct {...} & __f) (/usr/include/c++/11/bits/invoke.h:61)
libhalsim_ws_server.so!std::__invoke_r<void, wpilibws::HALSimWSProviderSimDevices::DeviceCreatedCallback(char const*, HAL_SimDeviceHandle)::<lambda()>&>(struct {...} & __fn) (/usr/include/c++/11/bits/invoke.h:111)
libhalsim_ws_server.so!std::_Function_handler<void(), wpilibws::HALSimWSProviderSimDevices::DeviceCreatedCallback(char const*, HAL_SimDeviceHandle)::<lambda()> >::_M_invoke(const std::_Any_data &)(const std::_Any_data & __functor) (/usr/include/c++/11/bits/std_function.h:290)
libhalsim_ws_server.so!std::function<void ()>::operator()() const(const std::function<void()> * const this) (/usr/include/c++/11/bits/std_function.h:590)
libhalsim_ws_server.so!operator()<wpi::promise<void> >(wpilibws::HALSimWSProviderSimDevices::LoopFn func, wpi::promise<void> out) (/work/simulation/halsim_ws_core/src/main/native/cpp/WSProvider_SimDevice.cpp:293)
libhalsim_ws_server.so!std::__invoke_impl<void, wpilibws::HALSimWSProviderSimDevices::Initialize(wpi::uv::Loop&)::<lambda(auto:31, wpilibws::HALSimWSProviderSimDevices::LoopFn)>&, wpi::promise<void>, std::function<void()> >(struct {...} & __f) (/usr/include/c++/11/bits/invoke.h:61)
libhalsim_ws_server.so!std::__invoke_r<void, wpilibws::HALSimWSProviderSimDevices::Initialize(wpi::uv::Loop&)::<lambda(auto:31, wpilibws::HALSimWSProviderSimDevices::LoopFn)>&, wpi::promise<void>, std::function<void()> >(struct {...} & __fn) (/usr/include/c++/11/bits/invoke.h:111)
libhalsim_ws_server.so!std::_Function_handler<void(wpi::promise<void>, std::function<void()>), wpilibws::HALSimWSProviderSimDevices::Initialize(wpi::uv::Loop&)::<lambda(auto:31, wpilibws::HALSimWSProviderSimDevices::LoopFn)> >::_M_invoke(const std::_Any_data &, wpi::promise<void> &&, std::function<void()> &&)(const std::_Any_data & __functor, wpi::promise<void> && __args#0, std::function<void()> && __args#1) (/usr/include/c++/11/bits/std_function.h:290)
libhalsim_ws_server.so!std::function<void (wpi::promise<void>, std::function<void ()>)>::operator()(wpi::promise<void>, std::function<void ()>) const(std::function<void()> __args#1, wpi::promise<void> __args#0, const std::function<void(wpi::promise<void>, std::function<void()>)> * const this) (/usr/include/c++/11/bits/std_function.h:590)
libhalsim_ws_server.so!std::__invoke_impl<void, std::function<void (wpi::promise<void>, std::function<void ()>)>&, wpi::promise<void>, std::function<void ()> >(std::__invoke_other, std::function<void (wpi::promise<void>, std::function<void ()>)>&, wpi::promise<void>&&, std::function<void ()>&&)(std::function<void(wpi::promise<void>, std::function<void()>)> & __f) (/usr/include/c++/11/bits/invoke.h:61)
libhalsim_ws_server.so!std::__invoke<std::function<void (wpi::promise<void>, std::function<void ()>)>&, wpi::promise<void>, std::function<void ()> >(std::function<void (wpi::promise<void>, std::function<void ()>)>&, wpi::promise<void>&&, std::function<void ()>&&)(std::function<void(wpi::promise<void>, std::function<void()>)> & __fn) (/usr/include/c++/11/bits/invoke.h:96)
libhalsim_ws_server.so!std::__apply_impl<std::function<void (wpi::promise<void>, std::function<void ()>)>&, std::tuple<wpi::promise<void>, std::function<void ()> >, 0ul, 1ul>(std::function<void (wpi::promise<void>, std::function<void ()>)>&, std::tuple<wpi::promise<void>, std::function<void ()> >&&, std::integer_sequence<unsigned long, 0ul, 1ul>)(std::tuple<wpi::promise<void>, std::function<void()> > && __t, std::function<void(wpi::promise<void>, std::function<void()>)> & __f) (/usr/include/c++/11/tuple:1854)
libhalsim_ws_server.so!std::apply<std::function<void (wpi::promise<void>, std::function<void ()>)>&, std::tuple<wpi::promise<void>, std::function<void ()> > >(std::function<void (wpi::promise<void>, std::function<void ()>)>&, std::tuple<wpi::promise<void>, std::function<void ()> >&&)(std::tuple<wpi::promise<void>, std::function<void()> > && __t, std::function<void(wpi::promise<void>, std::function<void()>)> & __f) (/usr/include/c++/11/tuple:1865)
libhalsim_ws_server.so!wpi::uv::AsyncFunction<void (std::function<void ()>)>::Create(std::shared_ptr<wpi::uv::Loop> const&, std::function<void (wpi::promise<void>, std::function<void ()>)>)::{lambda(uv_async_s*)#1}::operator()(uv_async_s*) const(uv_async_t * handle) (/work/wpinet/src/main/native/include/wpinet/uv/AsyncFunction.h:95)
libhalsim_ws_server.so!wpi::uv::AsyncFunction<void (std::function<void ()>)>::Create(std::shared_ptr<wpi::uv::Loop> const&, std::function<void (wpi::promise<void>, std::function<void ()>)>)::{lambda(uv_async_s*)#1}::_FUN(uv_async_s*)() (/work/wpinet/src/main/native/include/wpinet/uv/AsyncFunction.h:84)
libwpinet.so!uv__async_io(uv_loop_t * loop, uv__io_t * w, unsigned int events) (/work/wpinet/src/main/native/thirdparty/libuv/src/unix/async.cpp:177)
libwpinet.so!uv__io_poll(uv_loop_t * loop, uv_loop_t * loop@entry, int timeout) (/work/wpinet/src/main/native/thirdparty/libuv/src/unix/linux.cpp:1527)
libwpinet.so!uv_run(uv_loop_t * loop, uv_run_mode mode, uv_run_mode mode@entry) (/work/wpinet/src/main/native/thirdparty/libuv/src/unix/core.cpp:448)
libwpinet.so!wpi::uv::Loop::Run(wpi::uv::Loop::Mode mode, wpi::uv::Loop * const this) (/work/wpinet/src/main/native/include/wpinet/uv/Loop.h:113)
libwpinet.so!wpi::EventLoopRunner::Thread::Main(wpi::EventLoopRunner::Thread * const this) (/work/wpinet/src/main/native/cpp/EventLoopRunner.cpp:36)
libwpiutil.so!operator()(const struct {...} * const __closure) (/work/wpiutil/src/main/native/cpp/SafeThread.cpp:79)
libwpiutil.so!std::__invoke_impl<void, wpi::detail::SafeThreadOwnerBase::Start(std::shared_ptr<wpi::SafeThreadBase>)::<lambda()> >(struct {...} && __f) (/usr/include/c++/11/bits/invoke.h:61)
libwpiutil.so!std::__invoke<wpi::detail::SafeThreadOwnerBase::Start(std::shared_ptr<wpi::SafeThreadBase>)::<lambda()> >(struct {...} && __fn) (/usr/include/c++/11/bits/invoke.h:96)
libwpiutil.so!std::thread::_Invoker<std::tuple<wpi::detail::SafeThreadOwnerBase::Start(std::shared_ptr<wpi::SafeThreadBase>)::<lambda()> > >::_M_invoke<0>(std::thread::_Invoker<std::tuple<wpi::detail::SafeThreadOwnerBase::Start(std::shared_ptr<wpi::SafeThreadBase>)::<lambda()> > > * const this) (/usr/include/c++/11/bits/std_thread.h:259)
libwpiutil.so!std::thread::_Invoker<std::tuple<wpi::detail::SafeThreadOwnerBase::Start(std::shared_ptr<wpi::SafeThreadBase>)::<lambda()> > >::operator()(std::thread::_Invoker<std::tuple<wpi::detail::SafeThreadOwnerBase::Start(std::shared_ptr<wpi::SafeThreadBase>)::<lambda()> > > * const this) (/usr/include/c++/11/bits/std_thread.h:266)
libwpiutil.so!std::thread::_State_impl<std::thread::_Invoker<std::tuple<wpi::detail::SafeThreadOwnerBase::Start(std::shared_ptr<wpi::SafeThreadBase>)::<lambda()> > > >::_M_run(void)(std::thread::_State_impl<std::thread::_Invoker<std::tuple<wpi::detail::SafeThreadOwnerBase::Start(std::shared_ptr<wpi::SafeThreadBase>)::<lambda()> > > > * const this) (/usr/include/c++/11/bits/std_thread.h:211)
libstdc++.so.6!execute_native_thread_routine (Unknown Source:0)
libc.so.6!start_thread(void * arg) (/usr/src/debug/glibc-2.39-17.fc40.x86_64/nptl/pthread_create.c:447)
libc.so.6!clone3() (/usr/src/debug/glibc-2.39-17.fc40.x86_64/sysdeps/unix/sysv/linux/x86_64/clone3.S:78)

That thread is blocking waiting for the SimDeviceData mutex while HALSimWS registers a valueCreated callback in response to getting a network connection. Of course the SimDeviceData mutex is held by the first thread because it was acquired in SimDeviceData::CreateDevice(). Thus the deadlock.

Still pondering how to avoid this...

@brettle
Copy link
Member Author

brettle commented Jul 16, 2024

@CoolSpy3, how's this for a possible workaround:

  1. In WebotsSimulator, before it creates the Robot, it tells WebotsSupervisor (via NT) "Disconnect from HALSimWS".
  2. When WebotsSupervisor receives that, it calls wsConnection.object.closeBlocking() and then tells WebotsSimulator (via NT) "Disconnected".
  3. When WebotsSimulator receives that, it creates the Robot (and thus its SimDevices) and calls startCompetition(). When waitForUserToStart() gets called, it tells WebotsSupervisor to load the world.
  4. When WebotsSupervisor restarts after loading the world, it waits until the HALSimWS connection is made and no messages have been received over that connection for some period of time (1 sec?) before telling WebotsSimulator "Completed".
  5. waitForUserToStart() completes and startTimeSync() runs as it does now.

@brettle
Copy link
Member Author

brettle commented Jul 16, 2024

Opened issue wpilibsuite/allwpilib/issues/6842 to cover the underlying problem.

@brettle
Copy link
Member Author

brettle commented Jul 16, 2024

It looks like I've successfully worked around this in PR #118. The workaround is simpler than described above. Instead, WebotsSimulator just doesn't create the robot until after the HALSim connection has been made and been quiet for 0.5 seconds.

@brettle
Copy link
Member Author

brettle commented Jul 17, 2024

Per this PR comment, it looks like something closer to the original proposed workaround will be required.

@brettle
Copy link
Member Author

brettle commented Jul 17, 2024

@CoolSpy3, after thinking about this some more, this issue has ramifications for just using the DeepBlueSim controller to simulate robot code via VSCode "Simulate Robot Code". If the robot code has any devices that are implemented via SimDevice, that could trigger this deadlock. Proposed workaround:

  1. Change the controller to not start the HALSimWS client until it receives an NT request to do so.
  2. In Lib199Subsystem.periodic(), send the request. That should cover our team's robots.
  3. Document the need to either add Lib199Subsystem or manually send that request.
  4. Send the request from WebotsSimulator, probably during the onRobotInited() callback.

Thoughts?

@CoolSpy3
Copy link
Member

I think that should work. You may also need to add a signal at the end of each test to disconnect from HALSim, so that the old controller doesn't cause a deadlock (because the reload request isn't sent until after startCompetition()).

@CoolSpy3
Copy link
Member

Actually, it looks like that behavior is changed in #118, so it should be fine.

@brettle
Copy link
Member Author

brettle commented Jul 20, 2024

Fixed in #120.

@brettle brettle closed this as completed Jul 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants