Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EventLoop: direct epoll/kqueue integration #14959

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Sep 3, 2024

Integrates the epoll (Linux) and kqueue (*BSD, macOS) syscalls to handle the event loop on UNIX platforms.

Benefits

  • Can remove the libevent external dependency (still available as fallbackn, see below);
  • Better performance thanks to a design change;

Instead of adding a fd to the poll structure when the fd blocks and remove it when it's ready for read or write (then repeat), we now add it once and keep it there until we close the fd. This is the ideal scenario for epoll and kqueue.

Unlike the previous attempts to integrate epoll & kqueue directly that followed libevent's logic and didn't bring any performance improve and required a big lock (contented with MT) to keep a list of events, this change allows up to a +20% performance boost in an ideal scenario (http/server with long lived connections), and only requires fine grained locks for MT (usually uncontended).

To nobody's surprise: this is how Go's netpoll works.

Notes

The evloop supports preview_mt with still one evloop instance per thread (scheduler requirement). Execution contexts RFC #2 will have one evloop instance per context.

We transfer the fd from an evloop to another when it would block, that evloop becomes the sole "owner" of the fd. The transfer is automatic, there is nothing to do. This leads to a caveat: we can't have multiple fibers waiting for the same fd in different evloops (aka threads). Trying to transfer the fd will raise if there is any waiting fiber already. This is because an IO read/write can have a timeout which is registered in the current evloop timers, and timers aren't transferred. This also allows for future enhancements (e.g. evloop enqueues are always local).

This can be an issue for preview_mt, for example with multiple fibers waiting for connections on a server socket; this shall be mitigated with execution contexts from RFC #2 that will share an evloop instance per context —just don't share a fd in multiple contexts.

If you experience any issue, you can always recompile with the -Devloop_libevent compile-time flag to return to the regular libevent-based event loop instead of the shiny new one.

Review

The branch kept the whole history of commits from the previous epoll and kqueue branches, and have far too many commits. Maybe a couple of them could be extracted on their own.

Each syscall is abstracted in its own little struct: Crystal::System::Epoll, Crystal::System::TimerFD, etc. They could be simplified (possible some dead code).

  • Crystal::Evented namespace (src/crystal/system/unix/evented) contains the base implementation that the system specific Crystal::Epoll::EventLoop (src/crystal/system/unix/epoll) and Crystal::Kqueue::EventLoop (src/crystal/system/unix/kqueue) are built on.

  • Crystal::Evented::Timers is a basic data structure to keep a list of timers (one instance per evloop); it could be optimized (in follow-up pull requests)

  • Crystal::Evented::Event holds the event, be it IO or sleep or select timeout or IO with timeout, while FiberEvent wraps an Event for sleeps and select timeouts.

  • Crystal::Evented::PollDescriptor are allocated in a Generational Arena and keeps the list of readers and writers (events/fibers waiting on IO).

The run loop first waits on epoll/kqueue, canceling IO timeouts as it resumes fibers, then proceeds to process timers.

The epoll/kqueue call doesn't wait until the next erady timer (it could without MT and with preview_mt but can't for execution contexts). I instead rely on timerfd on linux and EVTFILT_TIMER on BSD to interrupt a blocking evloop wait. It also allows to circumvent the 1ms precision of epoll_wait on Linux.

References

Supersedes both #14814 and #14829.

We can't call EPOLL_CTL_MOD with EPOLLEXCLUSIVE. Let's disable it for
now and see later if we can replace it with a pair of EPOLL_CTL_DEL and
EPOLL_CTL_ADD.
Process.run sometimes hang forever after fork and before exec, because
it tries to close a fd that requires to lock, but another thread may
have already acquired the lock, while `fork` only duplicates the current
thread (the other ones are not, and the forked process was left waiting
for a mutex to be unlocked, which would never happen.
That required to allocate a Node for the interrupt event, which ain't a
bad idea.
Extracts the generic parts of the event loop into an intermediary class
between Crystal::EventLoop and Crystal::Epoll::EventLoop so we can reuse
it to implement the event loop on other similar syscalls (poll, kqueue).
Sometimes we only want a pair of fds, and not IO::FileDescriptor
objects.
For some reason specs fail with a fiber failing to raise an exception
because `pthread_mutex_unlock` failed with EPERM while trying to dequeue
the `Fiber#resume_event` from the event loop.

Re-creating the thread mutex after fork seems to fix the issue.
This allows to keep a file descriptor into the evloop for its whole
lifetime (from open to close) instead of adding it every time it would
block and removing it as soon as it unblocks. This brings over 20%
performance improvement on a simple HTTP/1.1 server (with keepalive).

Among the advantages: this allows to remove the global mutex around
handling IO events and instead have an almost never contended lock
around the reader or the writer waiting lists for each IO. We don't even
have to keep a global list of events (epoll and kqueue will do it).

The drawback is that the preview MT scheduler isn't compatible with this
scheme.
@ysbaddaden ysbaddaden self-assigned this Sep 3, 2024
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be renamed as Crystal::Evented::Arena since it's not a generic generational arena (memory region). It takes advantage that the OS kernels handle the fd number (it's guaranteed unique) and always reuse closed fd instead of growing (until it's needed).

An actual generational arena would keep a list of free indexes.

Note: the goal of the arena is to:

  • avoid repeated allocations;
  • avoid polluting the IO object with the PollDescriptor (doesn't exist in other evloops);
  • avoid saving raw pointers into kernel data structures;
  • safely detect allocation issues instead of segfaults because of raw pointers.

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Sep 3, 2024

A couple issues with preview_mt:

  1. Unhandled exception: Crystal::Evented::Event#wake_at cannot be nil (NilAssertionError) while trying to get the next ready timer after deleting a timer (to update timerfd). A queued event timer got its #wake_at property set to nil. Sounds like a thread safety issue with FiberChannel. Happened on my local.

  2. Unhandled exception: timerfd_settime: Invalid argument (RuntimeError) while trying to update timerfd after processing timers. The Sounds like the itimerspec structure has an invalid value (as per timerfd_settime man page) Happened on CI.

@ysbaddaden
Copy link
Contributor Author

More details about issue 2. The time is indeed completely wrong, so the event pointer may point to an invalid value (or invalid memory?)

Unhandled exception: timerfd_settime
time=-1623225768.01:54:32.415596704
itimerspec=LibC::Itimerspec(
  @it_interval=LibC::Timespec(@tv_sec=0, @tv_nsec=0),
  @it_value=LibC::Timespec(@tv_sec=140246706362072, @tv_nsec=-415596704)
): Invalid argument (RuntimeError)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants