Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement backoff for reconnect #112

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

BrendanBall
Copy link

@BrendanBall BrendanBall commented Jun 10, 2022

related to #111.

The current reconnect feature doesn't support backoff, so it's not
really useful since spamming reconnects is not good behaviour.
This introduces a backoff feature, taking inspiration from
https://hex.pm/packages/connection.

This was meant to be a small change, however with the existing
async and handle_initial_conn_failure options, the solution turned out
to be incoherent and also uncovered existing bugs with reconnect during
init that could indefinitely block the supervisor startup process.

As a result, this change makes breaking changes to the api, and
introduces some new apis to recover some previous functionality.
In particular, async init is now the default and only behaviour, ie.
a network connection is only made after finalizing the special process
init.

Additionally, I've removed send_frame from the selective receive during
connecting status since it is not expected for a connection to be in this status
for very long and this removes the need for an extra api to wait for the connection being connected.

You can now do the following:

def handle_disconnect(%{attempt: attempt}, state) do
  {:backoff, calculate_backoff(attempt), state}
end

I realised that currently handle_cast and handle_info messages are only handled when in the websocket_loop, ie. when connected because those can both return {:reply, frame, state} which would fail if not connected at that point.
This is probably currently fine since open_loop and close_loop aren't expected to last that long, so it's fine for a message to wait in the inbox until the connection is established.
However since I introduced a backoff_loop this is a problem because either

  1. handle_cast and handle_info messages must be handled in backoff_loop and :reply has to somehow be handled differently when not connected or
  2. handle_cast and handle_info messages must NOT be handled in backoff_loop which means messages could wait in the inbox for potentially a long time, especially since the backoff is given by the library user, which could conceivably be > 10 s if exponential backoff is implemented (this flexibility is provided precisely because we want the user to be able to choose their own backoff strategy for their use case).

Using {:reply, frame, state} already makes sending messages unreliable in the sense that the user isn't given feedback on whether the message was actually successfully sent, despite websocket itself being a reliable transport. This means that using this could result in messages unexpectedly getting lost, although currently if :reply results in any error then the connection is closed, and in practice this probably mostly happens because of connection errors.
I'm not sure how many people think about this when choosing between send_frame and :reply, but it is suggested to use :reply instead of send_frame (#34), which I don't think is a good thing if you want reliability.

I've added an additional callback, handle_send_result, which is called on success or error of processing a :reply action.
Note that I've not named it handle_reply_result because I think that the :reply action should be renamed since it only makes sense in handle_frame, but not in handle_cast and handle_info. This allows sending a message via :reply to be just as reliable as send_frame since the user is given feedback on whether the message was sent or not.

The current reconnect feature doesn't support backoff, so it's not
really useful since spamming reconnects is not good behaviour.
This introduces a backoff feature, taking inspiration from
https://hex.pm/packages/connection.

This was meant to be a small change, however with the existing
async and handle_initial_conn_failure options, the solution turned out
to be incoherent and also uncovered existing bugs with reconnect during
init that could indefinitely block the supervisor startup process.

As a result, this change makes breaking changes to the api.
In particular, async init is now the default and only behaviour, ie.
a network connection is only made after finalizing the special process
init.
The current way of sending messages via `handle_cast`, `handle_info`
callbacks doesn't allow the user to know whether the message was sent
successfully or not. This is particularly important now that a
connection might be in a backoff state for a long time.

This adds a new callback, `handle_send_result` which gets called on
success or error of processing a message to be sent. Note that this
doesn't apply to `send_frame` since that is synchronous and already
provides feedback. This isn't named `handle_reply_result` because I
believe `:reply` should be renamed since it only makes sense in
`handle_frame`, but not in `handle_cast` and `handle_info`.

This also allows passing through a key to correlate a result with a send
request.
BrendanBall pushed a commit to BrendanBall/stargate that referenced this pull request Jun 22, 2022
Currently Stargate prevents the application from starting up
or causes it to crash if pulsar is or becomes unavailable.
This is a result of how WebSockex is implemented.
This builds on a refactor of WebSockex at
Azolo/websockex#112
which implements connection management async to process startup.
This makes Stargate more production ready since it will allow
an application to degrade gracefully when pulsar isn't available
temporarily.

Main changes as a result of this:
1. Reconnect backoff feature is added to be customized by the user
2. Since Stargate will continue running even when not connected to
   Pulsar, the user needs to know that messages aren't being produced
successfully.
A breaking change was made to the async ACK MFA where the
first argument is now the result of the produce, which may either be
`:ok` or `{:error, reason}`.
BrendanBall pushed a commit to BrendanBall/stargate that referenced this pull request Jun 24, 2022
Currently Stargate prevents the application from starting up
or causes it to crash if pulsar is or becomes unavailable.
This is a result of how WebSockex is implemented.
This builds on a refactor of WebSockex at
Azolo/websockex#112
which implements connection management async to process startup.
This makes Stargate more production ready since it will allow
an application to degrade gracefully when pulsar isn't available
temporarily.

Main changes as a result of this:
1. Reconnect backoff feature is added to be customized by the user
2. Since Stargate will continue running even when not connected to
   Pulsar, the user needs to know that messages aren't being produced
successfully.
A breaking change was made to the async ACK MFA where the
first argument is now the result of the produce, which may either be
`:ok` or `{:error, reason}`.
BrendanBall pushed a commit to BrendanBall/stargate that referenced this pull request Jun 24, 2022
Currently Stargate prevents the application from starting up
or causes it to crash if pulsar is or becomes unavailable.
This is a result of how WebSockex is implemented.
This builds on a refactor of WebSockex at
Azolo/websockex#112
which implements connection management async to process startup.
This makes Stargate more production ready since it will allow
an application to degrade gracefully when pulsar isn't available
temporarily.

Main changes as a result of this:
1. Reconnect backoff feature is added to be customized by the user
2. Since Stargate will continue running even when not connected to
   Pulsar, the user needs to know that messages aren't being produced
successfully.
A breaking change was made to the async ACK MFA where the
first argument is now the result of the produce, which may either be
`:ok` or `{:error, reason}`.
BrendanBall pushed a commit to BrendanBall/stargate that referenced this pull request Jun 24, 2022
Currently Stargate prevents the application from starting up
or causes it to crash if pulsar is or becomes unavailable.
This is a result of how WebSockex is implemented.
This builds on a refactor of WebSockex at
Azolo/websockex#112
which implements connection management async to process startup.
This makes Stargate more production ready since it will allow
an application to degrade gracefully when pulsar isn't available
temporarily.

Main changes as a result of this:
1. Reconnect backoff feature is added to be customized by the user
2. Since Stargate will continue running even when not connected to
   Pulsar, the user needs to know that messages aren't being produced
successfully.
A breaking change was made to the async ACK MFA where the
first argument is now the result of the produce, which may either be
`:ok` or `{:error, reason}`.
## Next
### Enhancements
- Added `backoff` feature for reconnecting
- `send_frame/2` is no longer processed during `opening`, ie. it is queued until `connected` or failure to connect.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a breaking change, and potentially a large one in my opinion.

I'm torn on this. I understand that it could make the library more convenient, but it could also cause a message to be sent prematurely before some kind of required connection setup was finished.

Adding another method or keyword option would be my preference but that means more stuff to maintain. It could be useful though. I remain somewhat on the edge.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn on this. I understand that it could make the library more convenient, but it could also cause a message to be sent prematurely before some kind of required connection setup was finished.

Can you elaborate on this? Maybe I'm missing something but I'm not sure when a message could be sent prematurely because of this?

Initially I had failing tests since they currently assume that when the process is started, the connection is connected and able to send messages. Before I realized that I could make this change with selective receive so that the send_frame message waits in the inbox until the connection is connected, I added a feature WebSockex.await_status(pid, :connected) so that you could do the following:

{:ok, pid} = WebSockex.start_link(url)
WebSockex.await_status(pid, :connected)
WebSockex.send_frame(pid, message)

However I removed this and instead relied on selective receive because it means less work for the user and one less change in this PR.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_connect is used to queue up initial messages. By casting to itself, a process can queue up messages to send in a specific order after startup.

By handling send_frame after the connection is established, we actually put ourselves at much higher risk for a race condition. A frame could be sent before the initial connection messages were sent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I have 2 thoughts on this.

  1. I'm just exploring how this could work better ignoring breaking changes. Maybe first read my comment on handle_send_result. If instead of send_frame we have handle_send_result and handle_call which would allow a user to implement their own send_frame, and we have init, then the user could create a message queue in their state during init for startup messages and then handle_call could add messages to the queue as well and the queue can be processed in the right order starting from handle_connect. I also think that adding handle_continue support would be useful here. In fact if we just add handle_continue then this can be used from handle_connect to process initial messages before send_frame messages by calling :continue in handle_connect.

  2. Without making breaking changes, we need a way to wait for the connection to be established before calling send_frame. This can either be done by adding await_status as above, or by adding handle_call which has access to the current status. The user can theoretically keep track of the current status with handle_connect and handle_disconnect in their own state, and implement handle_call on top of handle_cast, but it's quite a lot of extra stuff for each user to add which is already a solved problem in a plain GenServer. On the one hand I kinda like adding await_status natively because it requires no extra work from the user, but on the other hand I'm really seeing that everything could be handled much better with more flexibility if WebSockex is on par with GenServer, namely having init, handle_call, handle_continue.

I was trying to solve all the issues I ran into with adding backoff without adding a lot of other features, but it seems that we'd be better served by first adding the other features that we can then more easily build on here.

Comment on lines +7 to +9
- Connection is always established asynchronously to process init
- `async` option is removed since it's now redundant
- `handle_initial_conn_failure` option is removed since it's no longer applicable
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually ruins one of the ways I'm using WebSockex myself. 😢

In my use case, there are a couple of other WebSockex processes that depend on the connection being open before they start then when things disconnect the supervisor uses a :rest_for_one policy to reconnect.

Originally, reconnecting was actually more of a feature request that I challenged myself to do. I rarely used it myself. 😣

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the reason why I removed sync connect is because as far as I know, it's bad practice to handle a potentionally long task in the init stage of a special process, and e.g. in a GenServer would normally be done by calling :continue in the init callback. The reason for this is that it can block the entire supervisor tree startup until init returns.
In the case of WebSockex, this can happen if:

  • the server you're connecting to is blocked by a firewall which drops all packets, so it hangs either until tcp timeout or an application timeout (I think WebSockex has either a 5s or 30s timeout)
  • you combine sync connect with infinite reconnect, which means init never returns.

You can also observe the result of this by making a simple WebSockex client with sync connect and infinite reconnect and just doing a WebSockex.start inside an iex session. The entire iex session locks up that you can't even do anything.

I prefer not to give users (of this library) footguns and unexpected behaviour which is why I decided to remove the ability to do this.

I can see that the biggest issue is actually the combination of sync connect and reconnect and that these 2 features are actually at odds of co-existing at the same time.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the reason why I removed sync connect is because as far as I know, it's bad practice to handle a potentionally long task in the init stage of a special process

This can definitely be true in the Application supervisor, but sometimes even that is necessary. I don't think you can consider it a bad practice, but it can be annoying to debug. On the other hand, it can also fit really well into the failure as feature mentality.

But there are people who want "footguns" simply because it fits into their application flow easier. That being said, handle_initial_conn_failure is a long and slightly convoluted option that is disabled by default for that reason. It was a requested feature, by multiple people iirc.

But my main argument will be that async unique startup is a feature. I don't expect :gen_tcp to startup then give me an error it can't connect after the incredibly long TCP timeout when I've already done other things with the assumption that I have a valid connection. I wouldn't expect WebSockex to startup without having a WebSocket connection established.

Comment on lines -1106 to +1116
{:reconnect, new_state} ->
{:reconnect, state.conn, new_state}
{:backoff, timeout, new_conn, new_state} ->
{:backoff, timeout, new_conn, new_state}

{:reconnect, new_conn, new_state} ->
{:reconnect, new_conn, new_state}
{:backoff, timeout, new_state} ->
{:backoff, timeout, state.conn, new_state}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't understand why we would remove :reconnect instead of just adding :backoff.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH it doesn't necessarily even have to be called :backoff. I don't think that it would really be confusing to just have:

{:reconnect, timeout, state}

Would you prefer to keep the reconnect word rather than backoff?

What I don't want though is to keep the existing behaviour of reconnect which basically does a while true: reconnect.
IMO this is one of those footguns that shouldn't exist. It results in spamming the server with reconnects and I've found that if a library enables bad behaviour by default, then this behaviour will end up occurring in the wild and make some poor devs/SREs very unhappy.

Of course a user could opt into this bad behaviour by doing

{:reconnect, 0, state}

But at least that's an explicit choice and in an explicit choice, a user might be more inclined to specify a timeout of 1s or something.

If we want to keep the existing result with no timeout {:reconnect, state} then IMO we should change the behaviour to add a default backoff of at least 1s or something. Of course this could also be considered a breaking change and I would personally prefer making the explicit breaking change of requiring an explicit timeout so that users opt into the change of behaviour by making the small refactor of {:reconnect, state} to {:reconnect, timeout, state} .

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can buy that.

I would say keep :reconnect and default {:reconnect, state} to {:reconnect, default, state}. That way the breaking change is the handling of messages.

I also think that we may need to nest receive blocks to catch the reconnect message as a priority.

Comment on lines +1132 to +1157
defp handle_send_result(result, frame, key, state) do
result = try_callback(state.module, :handle_send_result, [result, frame, key, state.module_state])

case result do
{:ok, new_state} ->
{:ok, new_state}

{:close, new_state} ->
{:close, new_state}

{:close, close_frame, new_state} ->
{:close, close_frame, new_state}

{:"$EXIT", _} = res ->
res

badreply ->
{:"$EXIT",
%WebSockex.BadResponseError{
module: state.module,
function: :handle_send_result,
args: [result, frame, state.module_state],
response: badreply
}}
end
end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the desire for this, but I think that this is actually a not great idea.

TLDR; Just because a message is sent doesn't mean that it was received. Rebuilding messages that need to be sent should depend on what the server has processed.

The reason why is that the WebSocket protocol actually doesn't have an acknowledgement mechanism built into it. Most people build those mechanisms around the protocol, erroring out if processing a message out of order, etc. We have a guarantee that our message was sent because of TCP guarantees, but we have no idea if our message was processed without an acknowledgement.

Some servers do this with custom ping messages wrapped in data frames, some servers require unsolicited pong messages with data payloads, some servers only use WebSockets for unidirectional data. There's a whole world out there of ways of handling messages that were sent.

But because the underlying mechanism is TCP, you could send 5 additional frames before the server responds that it is unable to process a frame. We may be able to say that this frame "failed to send", but the current frame wasn't the problem. Likewise, your last frame could have succeeded in getting to the proxy, but it never made it to the server.

So, I think that offering this kind of functionality would actually lead to bad practices in the design of the connection itself. Deciding on what needs to be sent or resent on a connection should be determined after getting some state from the server.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ye I understand there are many network failures that could occur. I actually added this because adding a backoff state changes the behaviour quite a bit. For example, if exponential backoff is implemented, it's not unrealistic that someone could set a max backoff to 10s depending on their application/use case. But imo this still applies even for a backoff of 1s.

On the one hand, we don't want to prevent handle_cast messages from being processed while in the backoff state (ie. excluding handle_cast messages from the selective receive in backoff state).
On the other hand, doing a :reply while in the backoff state results in a 100% chance for failure, which can be acted upon immediately by e.g. stopping sending more messages until connected again, or in my use case, NACKing the message immediately (see jeffgrunewald/stargate#26). The stopping sending messages until connected can of course be implemented via the handle_connect and handle_disconnect callbacks, but there's still the chance of messages getting lost because of race conditions, which can then only be NACK'ed on a timeout.

So yes this isn't a replacement for application ACKs, but rather a way to give fast feedback of messages guaranteed to NOT be delivered during a backoff state.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not wrong, but after reading your explanation I feel the deficiency in the WebSockex API may be that we need to have some kind of connection state passed to handle_cast and handle_info.

What are your thoughts on that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this and decided against it because that solution still allows for race conditions.
If in handle_cast you check that the connection is connected and then do a reply, and the connection drops before the reply is processed, then you still lose a message. Having a callback handle_send_result puts reply in the same position as send_frame, just that the handling is async. send_frame gives you feedback on whether the TCP packet was sent or if the connection wasn't established, e.g. by returning a NotConnected error.
IMO having these 2 apis be equally powerful in terms of gaurantees/feedback it provides is good.

The other thing I thought is that people have also asked for allowing the same features as GenServer. If WebSockex adds handle_call support then together with handle_send_result, a user is able to implement send_frame without it being a native api.

In general I feel that async apis should be more powerful, not less powerful than sync apis, and in any case, in Elixir, sync apis are always an abstraction over async apis.
So it just feels wrong that reply is less powerful/reliable than send_frame in terms of feedback/reliability.

@Azolo
Copy link
Owner

Azolo commented Jul 19, 2022

I think we have some disagreements on the underlying reliability of the WebSocket protocol. I tried to explain my point of view.

I also feel like you are gutting some important functionality for my personal use case, so I tried to point that out.

On the other hand, I understand the desire to want to get rid of some of the complexity that results from some of the features implemented for ease of use.

I think that unfortunately, there aren't as many changes that you can make without removing behavior that someone like me depends upon. The other problem is that WebSocket servers are so diverse in their implementations that we can't actually assume that the protocol is implemented 100% correctly.

I do really appreciate this though, and hope we can continue having a conversation on how to get this functionality into WebSockex.

@BrendanBall
Copy link
Author

I wasn't planning on making so many changes when I set out to add backoff support.
Some of the changes I made were for one of the following reasons:

  1. I needed it to get tests passing
  2. Keeping both features added a lot of complexity which seemed unnecessary to me, especially since it meant more things to test
  3. Some features (e.g. reconnect) I considered to be broken as is and would require small changes to upgrade to the new version (and since this library isn't at v1 yet, I see this as acceptable)

However I do prefer small PRs so if we can agree on a subset of the changes then I can try split it up into smaller PRs.
I think the main issue was that adding backoff had cascading effects for tests and certain assumptions/implications which prevented me from isolating it as a small PR.

Lets see what we can agree on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants