Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lazi-fy huggingface, langchain serve, litellm loading #717

Closed
wants to merge 8 commits into from

Conversation

leondz
Copy link
Owner

@leondz leondz commented Jun 3, 2024

no need to review/merge until #711 lands

@leondz leondz added generators Interfaces with LLMs quality-speed This affects the speed of program use labels Jun 3, 2024
@leondz leondz marked this pull request as ready for review June 6, 2024 15:38
@leondz leondz requested a review from jmartin-tech June 6, 2024 15:38
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pattern needs to follow the multiprocessing combined solutions from #645 and #689

Comment on lines 587 to 592

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

PIL = importlib.import_module("PIL")
self.Image = PIL.Image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerned about importing libs and storing them as attributes in the generator. This will introduce similar issue with pickle during multiprocessing that having the client in OpenAIGenerator produced.

These might be better served as methods called to load when not already set combined with a __getstate__() implementation similar to

# avoid attempt to pickle the client attribute
def __getstate__(self) -> object:
self._clear_client()
return dict(self.__dict__)

Copy link
Collaborator

@jmartin-tech jmartin-tech Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing some testing of this pattern in other huggingface changes and while I still like the idea of protecting class for safe pickle support there might be more work needed in general around how we share generator instances when allowing mulitprocessing. As is, shifting heavy lift objects to be created in each new process could be very expensive in terms of resources.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get this, it makes sense.

On the other hand - if one is doing multiprocessing with local models, the consumption of gigabytes/tens of gigabytes of GPU memory per instance seems to shrink the relative difficulties of storing libraries in the generator.

I am inclined to adopt a safer pattern here, but to not support proactively the general case of parallelisation with locally-run model, instead implementing a generators.base.Generator attribute specifying whether a generator is parallelisation-compatible and setting it False for 🤗 classes such as Model and Pipeline. Running parallel local models seems like an edge case - the parallelisation is intended for stuff that's lighter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementing a generators.base.Generator attribute specifying whether a generator is parallelisation-compatible and setting it False for 🤗 classes such as Model and Pipeline

This is the approach I am targeting first.

I am also considering looking for methods that can defer multiprocessing approaches for locally executing generators to something specifically provided by the generator, for instance hugginface provides the Accelerate library to allow the generator to execute interference using GPUs efficiently and might be reasonable to expose a _call_model_with_dataset() or something of the sort that can recieve a set of attempts and have the generator figure out how to parallelize them.

Copy link
Owner Author

@leondz leondz Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that could be really nice. I guess maybe worth doing after a pattern emerges for decoupling the attempt queue from running generation?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmartin-tech: heavy externals in these modules now loaded/unloaded using _load_client / _clear_client pattern

@leondz leondz requested a review from jmartin-tech June 13, 2024 09:46
Comment on lines +613 to +614
self.processor = self.LlavaNextProcessor.from_pretrained(self.name)
self.model = self.LlavaNextForConditionalGeneration.from_pretrained(
Copy link
Collaborator

@jmartin-tech jmartin-tech Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all the way down to _load_client() needs to move into _load_client() as the processor and model will not transfer well in a pickle

def _clear_client(self):
self.Image = None
self.LlavaNextProcessor = None
self.LlavaNextForConditionalGeneration = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processor and Model should not be included in pickle.

Suggested change
self.LlavaNextForConditionalGeneration = None
self.LlavaNextForConditionalGeneration = None
self.processor = None
self.model = None

self.LlavaNextProcessor = transformers.LlavaNextProcessor
self.LlavaNextForConditionalGeneration = (
transformers.LlavaNextForConditionalGeneration
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processor and Model need to be part of this method.

Suggested change
)
)
self.processor = self.LlavaNextProcessor.from_pretrained(self.name)
self.model = self.LlavaNextForConditionalGeneration.from_pretrained(
self.name,
torch_dtype=self.torch_dtype,
low_cpu_mem_usage=self.low_cpu_mem_usage,
)
if torch.cuda.is_available():
self.model.to(self.device_map)
else:
raise RuntimeError(
"CUDA is not supported on this device. Please make sure CUDA is installed and configured properly."
)

@@ -142,6 +155,8 @@ def __init__(self, name: str = "", generations: int = 10, config_root=_config):
" or in the configuration file"
)

self._load_client()
Copy link
Collaborator

@jmartin-tech jmartin-tech Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provider check is still needed and the key extraction should have been moved into a custom _validate_env_var() implementation. This looks like something I missed in #711 that incorrectly enforces an API key as required for all provider values.

Suggested change
self._load_client()
self._load_client()
def _validate_env_var(self):
if self.provider is None:
raise ValueError(
"litellm generator needs to have a provider value configured - see docs"
)
if self.provider == "openai":
return super()._validate_env_var()

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, uh, while dealing with this slow module, I started getting a failed test, but noticed that I had an openai env var key set, which meant the test actually ran. Have you ever seen the litellm tests pass? Looking at the tests we have, and the basic code examples on their website (see eg the invocations on https://docs.litellm.ai/docs/), the provider check seems to block intended functionality.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never had them run and pass as they both required keys, I had noted in the original PR that it seemed like a config file would be required to instantiate the class. Although there was a comment that said it was not required, the original embedded config parsing did require a provider. If provider was not found in _config.plugins.generators["litellm.LiteLLMGenerator"] it would raise a ValueError.

I intend to validate function as part of the testing here by setting up a local instance, however there is another issue with this class as the torch_dtype value cannot be accepted as a string. I have fixes for this in progress in the refactor branch I am working one. Short term I was intending to manually patch the torch_dtype default value to allow testing of this change in isolation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some testing, I have validated that LiteLLM as implemented does required a provider from the config file. This can be enforced using something like the suggestion updated in the top of this thread and would supply a method to suppress the api key requirement when supplying something other than openai as the provider.

A future PR can also expand the testing to provide a mock config that will would allow for mocking a response from the generator similar to the mock openai responses recently incorporated.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "as implemented" is something to be flexible on (see example in #755 ). This would remove the enforcement requirement. Unfortunately I don't have a good pattern for validating the input. I'm OK with relaxing the provider constraint and letting litellm do its own validation.

@leondz
Copy link
Owner Author

leondz commented Jun 13, 2024 via email

@leondz leondz marked this pull request as draft June 19, 2024 09:19
@leondz
Copy link
Owner Author

leondz commented Jul 5, 2024

may be resolved by #768 in which case will close

@leondz leondz closed this Jul 18, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
generators Interfaces with LLMs quality-speed This affects the speed of program use
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants