Use websockets to make on-demand worker file previews faster #4096

epicfaace · 2022-05-05T01:59:45Z

Use websockets to make on-demand worker file previews faster. File previews (such as loading stdout) of a worker that is currently running a bundle now take ~0.5 seconds.

Fixes #4084. This PR is basically a POC of my comment #4084 (comment) -- I've added a thin websocket layer so that when a user requests to view one of the worker's files:

the server will then ping the worker through the websocket
the worker will then check in
the worker will then send back the file.

This also gives us the latitude to change the worker's default checkin frequency from 5 seconds -> 20 seconds, further decreasing the load on the rest server.

TODOs:

don't hardcode the ws-server URL
test this on dev, with ~20 workers
make sure worker websocket listening thread code is thread-safe

percyliang · 2022-05-08T06:02:27Z

codalab/worker/main.py

@@ -138,7 +138,7 @@ def parse_args():
        '--checkin-frequency-seconds',
        help='Number of seconds to wait between worker check-ins',
        type=int,
-        default=5,
+        default=20,


I still think it would be nice to update the worker status at least once every 5 seconds...we don't need this to be 20, right?

No -- we can keep it at 5

epicfaace · 2022-05-11T20:56:57Z

epicfaace · 2022-05-11T21:00:27Z

codalab/worker/worker.py

+                        logging.warn(
+                            f"Got websocket message, got data: {data}, going to check in now."
+                        )
+                        self.checkin()


For concurrency:

hold a global lock when running checkin()

also hold the lock in the worker loop 1) when running checkin() 2) when processing bundles

epicfaace · 2022-05-13T20:54:44Z

codalab/bin/ws_server.py

+    print("RSH")
+    worker_id = await websocket.recv()
+    logger.warn(f"Got a message from the rest server, to ping worker: {worker_id}.")
+


docs/Server-Setup.md

epicfaace · 2022-10-12T17:55:22Z

docs/Code-Overview.md

@@ -154,6 +154,40 @@ containers, and periodically report on the status of the runs.
 A worker also has to respond to various commands such as reading files in the
 bundle while it's running, killing bundles, etc.

+All data transfer between the worker and the server happens through a process known


@AndrewJGaut here's an explanation of my approach here, would love any feedback!

…e checkin but process_runs also accesses (and changes) the global variable self.runs. Updated to lock that function as well

AndrewJGaut · 2022-11-09T19:15:09Z

@epicfaace Can you please review the changes made since your last change? They are quite minor. Once you give the LGTM, I'll approve it and merge it

epicfaace

lgtm

epicfaace · 2022-11-30T17:06:49Z

codalab/worker/worker.py

-        self._checkin_lock = Lock()
+        # Lock ensures listening thread and main thread don't simultaneously
+        # access the runs dictionary, thereby causing race conditions.
+        self._lock = RLock()


Why use RLock instead of Lock?

It's just for elegance's sake.

RLock allows the lock to be acquired multiple times (provided it is also released precisely as many times at is acquired). This is desirable in this case because, to avoid race conditions, we must call self.process_runs before the lock is released within the checkin() function. By using RLock, we allow the code to acquire the lock a second time so that we can run process_runs within checkin.

If we didn't do this, we could use a flag to tell process_runs we already acquired the lock and not acquire the lock in that case, but that seemed less elegant to me.

codalab/worker/worker.py

epicfaace · 2022-11-30T17:07:32Z

tests/cli/test_cli.py

@@ -1724,7 +1724,7 @@ def test_run(ctx):
    # Test that bundle fails when run without sufficient time quota
    _run_command([cl, 'uedit', 'codalab', '--time-quota', '2'])
    uuid = _run_command([cl, 'run', 'sleep 100000'])
-    wait_until_state(uuid, State.KILLED, timeout_seconds=60)
+    wait_until_state(uuid, State.KILLED, timeout_seconds=63)


why from 60 to 63?

epicfaace · 2022-11-30T17:08:46Z

codalab/model/worker_model.py

-    def send_json_message(self, socket_id, message, timeout_secs, autoretry=True):
+    def _ping_worker_ws(self, worker_id):
+        async def ping_ws():
+            async with websockets.connect("ws://ws-server:2901/main") as websocket:


We need to ensure this isn't hardcoded.

epicfaace · 2022-11-30T17:08:52Z

codalab/worker/worker.py

+            while not self.terminate:
+                logging.warn(f"Connecting anew to: ws://ws-server:2901/worker/{self.id}")
+                async with websockets.connect(
+                    f"ws://ws-server:2901/worker/{self.id}", max_queue=1


We need to ensure this isn't hardcoded.

epicfaace · 2022-11-30T17:09:58Z

@AndrewJGaut my main feedback is that ws-server:2901 shouldn't be hardcoded. we should ensure we make use of the ws_port configured in codalab_service.py. Once you add that, feel free to merge!

epicfaace added 4 commits May 5, 2022 01:59

ws

76d4eb7

update

9526656

f

3cdae56

fixed!

a492c3c

epicfaace changed the title ~~websocket test~~ Use websockets to make on-demand worker file previews faster May 5, 2022

epicfaace added 4 commits May 5, 2022 16:50

fix

3c64fd7

update docs

aaa2b77

fix

1289eb2

fix

0f7297d

percyliang reviewed May 8, 2022

View reviewed changes

Update main.py

8dd3eea

epicfaace commented May 11, 2022

View reviewed changes

epicfaace commented May 13, 2022

View reviewed changes

docs/Server-Setup.md Outdated Show resolved Hide resolved

epicfaace added 8 commits May 31, 2022 15:34

Update docs/Server-Setup.md

cb74b1f

Merge branch 'master' into ws

80ef96d

Add lock

454ebba

fix

7bc2a1d

comments

5ad3357

Merge branch 'ws' of github.com:codalab/codalab-worksheets into ws

d5ae465

ws-server

4a9fabb

update

e2645b4

epicfaace requested a review from AndrewJGaut October 12, 2022 17:54

epicfaace commented Oct 12, 2022

View reviewed changes

AndrewJGaut added 5 commits October 25, 2022 21:11

Merge branch 'master' of github.com:codalab/codalab-worksheets into ws

71d3519

minor changes

57e0c3c

Previous implementation was not thread safe because it only locked th…

677149d

…e checkin but process_runs also accesses (and changes) the global variable self.runs. Updated to lock that function as well

Merge master and resolve conflicts

ce12379

add in an extra lock to eliminate a lot of the race conditions

28743ad

AndrewJGaut and others added 8 commits November 1, 2022 15:04

minor change to avoid one more race condiiton

6a01254

minor formatting change

c2f77f8

extend timeout for time quota check

ff47fe5

slight timeout modification

2044169

Make the checkin process more efficient

57a98e4

slight change to timeout

aabb81d

Clean up code

94461de

Merge branch 'master' into ws

6fbeaec

AndrewJGaut marked this pull request as ready for review November 9, 2022 06:35

Merge branch 'master' into ws

5b81b5f

mionr change:

5424bb5

epicfaace commented Nov 30, 2022

View reviewed changes

epicfaace and others added 7 commits November 30, 2022 12:10

Merge branch 'master' into ws

b584529

remove hardcoded websocket server address

fafcbb2

Merge branch 'ws' of github.com:codalab/codalab-worksheets into ws

a69d7ae

Merge branch 'master' of github.com:codalab/codalab-worksheets into ws

1ea53fb

slight modifications

c273602

remove hardcoding comletely

e3e9dc1

resolve conflict

7e39608

AndrewJGaut approved these changes Dec 2, 2022

View reviewed changes

AndrewJGaut merged commit 5c59e6a into master Dec 2, 2022

AndrewJGaut deleted the ws branch December 2, 2022 22:54

AndrewJGaut mentioned this pull request Dec 6, 2022

Rc1.5.13 #4321

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use websockets to make on-demand worker file previews faster #4096

Use websockets to make on-demand worker file previews faster #4096

epicfaace commented May 5, 2022 •

edited

Loading

percyliang May 8, 2022

epicfaace May 8, 2022

epicfaace commented May 11, 2022

epicfaace May 11, 2022

epicfaace May 13, 2022

epicfaace Oct 12, 2022

AndrewJGaut commented Nov 9, 2022

epicfaace left a comment

epicfaace Nov 30, 2022

AndrewJGaut Nov 30, 2022

epicfaace Nov 30, 2022

epicfaace Nov 30, 2022

epicfaace Nov 30, 2022

epicfaace commented Nov 30, 2022

Use websockets to make on-demand worker file previews faster #4096

Use websockets to make on-demand worker file previews faster #4096

Conversation

epicfaace commented May 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epicfaace commented May 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewJGaut commented Nov 9, 2022

epicfaace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epicfaace commented Nov 30, 2022

epicfaace commented May 5, 2022 •

edited

Loading