-
-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a simple inference webserver #353
base: main
Are you sure you want to change the base?
Conversation
Is it possible to allow submitting inference request during training? Just like running a manual eval. |
Could always run a separate server while training. |
The idea is to fully integrated into training. The server accepts requests and put it into a queue, the model being trained is register with a callback to dequeue the request and inference it, just as running evals during training. In the case, we can reuse the model in training to do live inference at will. I admit that this might be a separate PR. |
I agree, that's a separate PR. :) This is a very simple lightweight server, just a modification of the existing inference code to get instructions from HTTP requests rather than the terminal, and return them to the requesting client rather than the screen. If you want to take the time to configure it to run training and inference through a queue, go ahead :) I implemented it because to run test inferences, you either have to do it completely manually at present (typing / pasting into a terminal for each one), or you have a long overhead of waiting for the inference server to start up for every single test. By starting up a HTTP server and accepting requests on it, you can automatically run many different tests without a ton of overhead. Or use it for non-testing / production purposes, for that matter. |
Yes, currently it's not able to reuse the loaded model to inference a batch of inputs, very inconvenient. |
if not instruction: | ||
response = "" | ||
else: | ||
default_tokens = {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could probably grab these from the tokenizer, as any special tokens defined in the config are added to the tokenizer when it is instantiated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patches welcome. :) Again, this is just redoing a copy of the preexisting inference section as a webserver. I didn't change the logic.
server_address = (cfg.server_addr, cfg.server_port) | ||
httpd = socketserver.TCPServer(server_address, lambda *args, **kwargs: HttpHandler(*args, cfg=cfg, prompter=prompter, tokenizer=tokenizer, model=model, **kwargs)) | ||
print(f"Server running on port {cfg.server_port}") | ||
httpd.serve_forever() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be explicitly killed at the end of training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you train and run inference at the same time? This only runs if you're in --inference mode.
Might be able to use Gradio as a web server #812 |
Usage example:
accelerate launch scripts/finetune.py summarize.yaml --inference --base_model=path/to/my/model --load_in_8bit=True --server --server_port 1567 --server_addr 127.0.0.1
Then in another terminal:
curl -X POST -d "$(cat test_text.txt)" http://localhost:1567/