Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add connection resume, client state tracking, robust reconnect logic, reconnecting hooks, more #1204

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

JamesHutchison
Copy link

@JamesHutchison JamesHutchison commented Mar 5, 2024

By submitting this pull request you agree that all contributions to this project are made under the MIT license.

Issues

While in a discussion on the discord, it became apparent that ReactPy was unusable for any application where dynamic scaling is required. For example, in cases where Azure/GCP will start/stop workers for dynamic load balancing. As a result this, users will be disconnected randomly and then would have the state of the ReactPy components reset.

There's also other issues I encountered while building this. Please see my comments (will be added soon) which walk through this very large PR. The intent is not to have this PR merged in all at once, rather, it is a guide to subsequent smaller PRs that can properly add tests, etc. I do not have the time to properly land this PR, so I am asking for help.

Solution

The solution to the problem was to have the client store the state that's used for rendering components. The theory is that if the client state matches the server state, then things should work just fine.

In reality, there was some challenges to get to that. The first was that IDs of components and events were random instead of deterministic, so this needed to be changed. Now the IDs are based on their patch path.

The next concern is security. While the state isn't obvious to most users, without proper security a client could manipulate variables that an author would have assumed could only be set by the server. An example might be is_admin or something, or maybe the user ID of user viewing the page. Another issue is that server secrets might leak to the client if someone isn't careful.

For security, all data given a sha-256 hash as a signature. The server will reject a state variable if it doesn't match the hash. Users are given a personal salt, which they must provide to the server upon reconnection, and the server has a secret "pepper" that is added to make things more difficult. I wasn't satisfied with this, so I added OTP codes that will change, by default, every 4 hours. The OTP codes use site-packages directory's contents (specifically, the parent directory of where reactpy is installed) and the create times to generate a random secret that feeds into the SHA-1 function that generates them. The OTP codes (right now, in the past, and in the future) are added to the hash just to make it that much harder.

All state values have a key that is derived from the file and line number. To avoid leaking this information via brute force, the key is a sha-256 that has about half the bits cut off, making it difficult to recreate.

The client behavior has been revamped. It now will always reconnect, has improved reconnect behavior, and also will disconnect on idle activity (default is 4 minutes). When the user moves the mouse or scrolls, it will reconnect. There's also hooks for when it is reconnecting and has successfully reconnected. The default behavior is to gray out the page and display a spinning pipe character. It's not great.

The performance impact appears to be negligible.

Checklist

The checklist will be for each individual feature that should be a separate PR. TBD

Work items remaining (this PR):

  • Add serialization for datetime types such as timezone and timedelta. This logic is currently in Heavy Resume but makes sense to just give to everyone.
  • Fix the server outrunning the client by moving the message handlers out of the tsx file and into the javascript file
  • Comment on the PR on everything

Extracted Issue List

  • - A - bug fix for when server would return "layout-update" before the client was ready.
  • - B - bug fix for inconsistent script rendering that resulted in double execution of scripts. Scripts also always are returned to the DOM for user inspection rather than executed directly.
  • - C - Message updates
  • - D - Have client keep track of state variables (the big one)
  • - E - add ability to debug messages sent and received on the client
  • - F - Add idle disconnect time to client
  • - G - Add connection timeout so that if you connect to a bad node the client doesn't stay stuck to it
  • - H - Improve message types in client
  • - I - switch to orjson for faster serialization
  • - J - Bug fix for corrupted hook state - change hook state to a contextvar
  • - K - Usage of slots
  • - L - add priority queue to component rendering - not functioning as expected currently
  • - M - messages should be a class and not a typed dict
  • - N - Use of protocols (substantial performance and usability issues)
  • - O - don't use hasattr in the manner identified

* Fix component decorator eating static type hints
Add reconnection and client state side state capabilities
* Delete commented out code

* add serialization for timezone and timedelta

* don't show reconnecting layer on first attempt

* Only connect after layout update handlers are set up

* perf tweak that apparently was never saved

* deserialization as well for timezone

* timezone and timedelta

* alter z-index
@Archmonger
Copy link
Contributor

This is definitely way too much within one PR. Can you break this into smaller PRs?

@JamesHutchison
Copy link
Author

Assuming this headache of mine doesn't turn into something worse, I'll add comments tomorrow. I don't have the capacity to properly break this out, create / update tests, etc for the changes and I'm asking the community for help to land these features. This PR is intended to be a starting point and is not intended to be merged.

@Archmonger
Copy link
Contributor

Also just to re-iterate what I said over discord, I've only experienced WS disconnections under high load when ReactPy is not using a BACKHAUL_THREAD.

A potential alternative to client-side state storage is simply migrating that implementation to core.

@JamesHutchison
Copy link
Author

It's not clear to me how backhaul threads are equivalent to this. Do you have a doc that explains the architecture? My impression from your description was that it helped with stability in django but it wouldn't help if your node count was scaled down and you still had active connections on the terminated node (same with a client's internet disconnecting / reconnecting).

I think having the client manage their own state is a perfect solution. It very much simplifies the infrastructure and reduces costs. There's a slight delay due to the copy of data but since you're reconnecting there's going to be a delay anyways.

@@ -29,7 +29,7 @@ export function Layout(props: { client: ReactPyClient }): JSX.Element {

useEffect(
() =>
props.client.onMessage("layout-update", ({ path, model }) => {
props.client.onLayoutUpdate((path: string, model: any) => {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue A: bug fix for when server would return "layout-update" before the client was ready.

Comment on lines +128 to +136
const scriptElement: HTMLScriptElement = document.createElement("script");
for (const [k, v] of Object.entries(model.attributes || {})) {
scriptElement.setAttribute(k, v);
}
if (scriptContent) {
scriptElement.appendChild(document.createTextNode(scriptContent));
}
}, [model.key, ref.current]);
ref.current.appendChild(scriptElement);
}, [model.key]);
Copy link
Author

@JamesHutchison JamesHutchison Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue B: bug fix for inconsistent script rendering that resulted in double execution of scripts. Scripts also always are returned to the DOM for user inspection rather than executed directly.

The current behavior was to immediately execute a script with no properties. If it had properties, it went to the DOM instead. These different code paths likely hid the bug where scripts were double executing.

The fix for the double execution was to remove ref.current from the dependency list

Comment on lines +15 to +21
export type ReconnectingCheckMessage = {
type: "reconnecting-check";
value: string;
}

export type IncomingMessage = LayoutUpdateMessage | ReconnectingCheckMessage;
export type OutgoingMessage = LayoutEventMessage | ReconnectingCheckMessage;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue C: Message updates. I think looking at it out now I didn't finish this with the most recent messages, such as "state-update"

Comment on lines +19 to +20
onLayoutUpdate(handler: (path: string, model: any) => void): void;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue A

Comment on lines +34 to +39

/**
* Update state vars from the server for reconnections
* @param givenStateVars State vars to store
*/
updateStateVars(givenStateVars: object): void;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue D: Have client keep track of state variables

Comment on lines +95 to +104
def get_caller_info():
# Get the current stack frame and then the frame above it
caller_frame = sys._getframe(2)
for i in range(50):
render_frame = sys._getframe(4 + i)
patch_path = render_frame.f_locals.get("patch_path_for_state")
if patch_path is not None:
break
# Extract the relevant information: file path and line number and hash it
return f"{caller_frame.f_code.co_filename} {caller_frame.f_lineno} {patch_path}"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugly hack - to preserve the interface, it walks up the stack frames to find patch_path_for_state in the locals. I would suggest refactoring it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns the unique identifier of each state / ref/ hook call based on what called it. This is eventually hashed and then the hash is truncated to create the key. The purpose of truncate the hash is to make it more challenging to reproduce this information.

Comment on lines +113 to +114
if __debug__:
__DEBUG_CALLER_INFO_TO_STATE_KEY[result] = caller_info
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When keys are missing, or the value doesn't serialize / deserialize this variable could be used to figure out what its talking about

class _CurrentState(Generic[_Type]):
__slots__ = "value", "dispatch"
__slots__ = "key", "value", "dispatch"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue K - slots are probably fine here

Comment on lines +190 to +198
hook = get_current_hook()
if hook.reconnecting.current:
if not isinstance(dependencies, ReconnectingOnly):
return
dependencies = None
else:
if isinstance(dependencies, ReconnectingOnly):
return
dependencies = _try_to_infer_closure_values(function, dependencies)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for whether something is executed on reconnection. The use_memo and use_ref lines must be located where they will always get called

Comment on lines -207 to +267
logger.debug(f"{current_hook().component} {new}")
logger.debug(f"{get_current_hook().component} {new}")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue J - current_hook was renamed to get_current_hook to follow naming convention

Comment on lines +79 to +84
"reconnecting",
"client_state",
"_state_recovery_serializer",
"_state_var_lock",
"_hook_state_token",
"_previous_states",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue J - probably shouldn't be using slots

Comment on lines +96 to +98
# if not isinstance(root, ComponentType):
# msg = f"Expected a ComponentType, not {type(root)!r}."
# raise TypeError(msg)
Copy link
Author

@JamesHutchison JamesHutchison Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue N - this is very expensive. On my performance trace, nearly half the render time was going to isinstance checks for protocols. I highly recommend removing protocols entirely because they hurt the developer experience since they are disjoint from the actual classes, and because they incur a performance penalty. Just use a base class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think these checks should be behind a if __debug__: so that we're not wasting cycles in production on something like this.

self.root = root
self.reconnecting = Ref(False)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ref is significant. When stuff gets created, on reconnection, it just hangs out forever until its reused.

I would prefer refactoring the layout code so that its a bit cleaner and this isn't tied to the state like this. Like you just have a context var that has everything you need to know. However, the current way things are executed get in the way of doing that because stuff just ends up on the main loop in isolation.

Comment on lines 108 to +113
async def __aenter__(self) -> Layout:
return await self.start()

async def start(self) -> Layout:
self._hook_state_token = create_hook_state()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue D - the refactor to start is needed so that we can restart the process from scratch if state rebuild fails.

@@ -122,7 +174,7 @@ async def deliver(self, event: LayoutEventMessage) -> None:
except Exception:
logger.exception(f"Failed to execute event handler {handler}")
else:
logger.info(
logger.warning(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell this shouldn't happen under normal circumstances so I upgraded this to a warning

max_num_state_objects: int = 512,
max_object_length: int = 40000,
default_serializer: Callable[[Any], bytes] | None = None,
deserializer_map: dict[type, Callable[[Any], Any]] | None = None,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A map of object to deserialization. By default, TheObjectType(**kwargs) is tried which probably won't work for nested objects, but might.

Comment on lines +166 to +173
type_id = b"1" # bool
if obj is None:
return "0", "", ""
match obj:
case True:
result = b"true"
case False:
result = b"false"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra complexity is for performance reasons.

None is not signed.
Boolean values are really common so this is a shortcut.

if isinstance(child, ComponentType) and child.key is None:
if child.key is None and isinstance(child, ComponentType):
Copy link
Author

@JamesHutchison JamesHutchison Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue N - perf: since this is a warning condition, the cheaper check is done first.

Comment on lines +47 to +49
@property
def value(self) -> _RefValue:
return self.current
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aligns the interface with use_state for serialization

def __init__(
self, initial_value: _RefValue = _UNDEFINED, key: str | None = None
) -> None:
from reactpy.core._life_cycle_hook import get_current_hook
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this works around a circular import

@JamesHutchison
Copy link
Author

JamesHutchison commented Mar 5, 2024

My suggestion on approach:

  • Keep the javascript client updates and fixes as a single PR, even though it resolves multiple issues. It either needs to be backwards compatible or could merge into the PR that does the server side of Issue D
  • Some of these are small but big wins. I would do them first. For example, fixing the performance problem caused by the protocols isinstance checks is only a couple lines of code
  • Some are worth discussing before action. The priority queue sounds good but doesn't actually work as expected. The render times are also really small so the value of a priority queue is questionable. It might be useful in the future though.
  • Measure using a tool like py-spy or something to get performance bottlenecks rather than making any assumptions. I thought the sha256 calls would show up somewhere but they, and their parent functions, haven't.

@JamesHutchison
Copy link
Author

Example usage that supports Pydantic / SQLModel objects:

def default_serializer(obj: Any) -> Any:
    if isinstance(obj, BaseModel):
        return obj.dict()
    raise TypeError(f"Object of type {obj.__class__.__name__} is not JSON serializable")


state_recovery_manager = StateRecoveryManager(
    serializable_types=[
        TemporaryUser,
        QAMessage,
    ],
    pepper="pepper",
    default_serializer=default_serializer,
)

configure(app, Main, Options(head=head, cors=cors_options), state_recovery_manager)

…on (#6)

* Fix for client not reconnecting on activity
* Type improvements
* comment improvements
@JamesHutchison
Copy link
Author

If you want to see this in action, it is live on heavyresume.com

Inactivity timeout is set to 4 minutes. You can see it happen in the console. If you scroll down and let it time out, you'll notice that it doesn't regenerate the view on reconnect. Moving the mouse or scrolling triggers reconnection.

Thanks!

@JamesHutchison
Copy link
Author

JamesHutchison commented Mar 16, 2024

One thing I'd like to bring attention to - if the layout logic is rewritten, please leave some means to re-render all components of a certain type. Next best thing is some means to trigger a whole page re-render from the client state. The use case would be hot reloading. Right now I can successfully use jurigged to hot reload - however unless the component has a toggle or something to force a re-render, I have to reload the page.

edit: this is no longer true with my PR below. That PR will disconnect and reconnect and force a rerender.

@JamesHutchison
Copy link
Author

JamesHutchison commented Mar 20, 2024

This PR (on my fork) leverages the logic from this PR to add hot reloading. I will not be merging it into my main branch because I don't want to clutter things up further, and I'm fine just checking out that one branch in the mean time.

JamesHutchison#12

@JamesHutchison
Copy link
Author

JamesHutchison commented May 3, 2024

This linked PR fixes a bug where use_effect(my_func, []) was rerunning after the next reconnection. The fix from that PR applies to this one.

JamesHutchison#16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants