Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/954 llama cpp #1000

Open
wants to merge 33 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
c9ed5fd
Support embeddings generation using llama_cpp
bikash119 Sep 24, 2024
c3464bc
Added llama-cpp-python as optional dependency
bikash119 Sep 24, 2024
582ca40
- Added normalize_embeddings argument to allow user to pass if the em…
bikash119 Sep 25, 2024
fba8ada
Update pyproject.toml
bikash119 Sep 26, 2024
e288b31
- Updated test to allow developer to define test model location.
bikash119 Sep 26, 2024
d6d4352
Merge remote-tracking branch 'upstream/develop' into feat/954_llama-cpp
bikash119 Sep 26, 2024
a936a39
- Made the test session scope
bikash119 Sep 26, 2024
316afa0
- Reverted the changes made to model_path
bikash119 Sep 26, 2024
7137883
- Implement test_encode_batch to verify various batch sizes
bikash119 Sep 26, 2024
2d0aa76
- Included LlamaCppEmbeddings to __ini__.py
bikash119 Sep 26, 2024
778532f
- Use HF_TOKEN to download model from hub to generate embeddings.
bikash119 Sep 30, 2024
55c3a0d
- Download from hub is now available through mixin
bikash119 Oct 2, 2024
935cdb8
Revert "- Download from hub is now available through mixin"
bikash119 Oct 3, 2024
29a8d56
Revert "- Use HF_TOKEN to download model from hub to generate embeddi…
bikash119 Oct 3, 2024
b40b0d2
- Removed mixin implemenation to download the model
bikash119 Oct 3, 2024
b08f3ae
- Additional example added for private / public model
bikash119 Oct 4, 2024
a49363c
- The tests can now be configured to use cpu or gpu based on paramete…
bikash119 Oct 4, 2024
575f48e
- repo_id or model_path : one of the parameters is mandatory
bikash119 Oct 4, 2024
48dce7b
Added description to attribute : model
bikash119 Oct 4, 2024
0e1fb8e
- Fixed examples
bikash119 Oct 4, 2024
f72ef30
Updated examples
bikash119 Oct 4, 2024
8218242
Update src/distilabel/embeddings/llamacpp.py
bikash119 Oct 14, 2024
db00482
Update src/distilabel/embeddings/llamacpp.py
bikash119 Oct 14, 2024
0fb7f15
Update src/distilabel/embeddings/llamacpp.py
bikash119 Oct 14, 2024
155feb2
Updated test to set disable_cuda_device_placement=True when testing f…
bikash119 Oct 14, 2024
b218b44
Merge branch 'develop' into feat/954_llama-cpp
bikash119 Oct 14, 2024
58aa996
Merge branch 'develop' into feat/954_llama-cpp
bikash119 Oct 16, 2024
3659400
testcase will by default load the model to cpu
bikash119 Oct 16, 2024
92481b0
Merge branch 'feat/954_llama-cpp' of github.com:bikash119/distilabel …
bikash119 Oct 16, 2024
ef98d63
Merge branch 'develop' into feat/954_llama-cpp
bikash119 Oct 19, 2024
2258190
Updated import statements to allign with new folder structure
bikash119 Oct 26, 2024
da92cc9
example code updated
bikash119 Oct 26, 2024
09dd551
examples fixed
bikash119 Oct 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,6 @@ venv.bak/
# Other
*.log
*.swp
.DS_Store
.DS_Store
#models
tests/model
bikash119 marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 2 additions & 0 deletions src/distilabel/embeddings/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,13 @@
# limitations under the License.

from distilabel.embeddings.base import Embeddings
from distilabel.embeddings.llamacpp import LlamaCppEmbeddings
from distilabel.embeddings.sentence_transformers import SentenceTransformerEmbeddings
from distilabel.embeddings.vllm import vLLMEmbeddings

__all__ = [
"Embeddings",
"SentenceTransformerEmbeddings",
"vLLMEmbeddings",
"LlamaCppEmbeddings",
]
251 changes: 251 additions & 0 deletions src/distilabel/embeddings/llamacpp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from pathlib import Path
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union

from pydantic import Field, PrivateAttr

from distilabel.embeddings.base import Embeddings
from distilabel.llms.mixins.cuda_device_placement import CudaDevicePlacementMixin
from distilabel.mixins.runtime_parameters import RuntimeParameter

if TYPE_CHECKING:
from llama_cpp import Llama


class LlamaCppEmbeddings(Embeddings, CudaDevicePlacementMixin):
bikash119 marked this conversation as resolved.
Show resolved Hide resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't some of the attributes already present in the parent class?

"""`LlamaCpp` library implementation for embedding generation.

Attributes:
model_name: contains the name of the GGUF quantized model, compatible with the
installed version of the `llama.cpp` Python bindings.
model_path: contains the path to the GGUF quantized model, compatible with the
installed version of the `llama.cpp` Python bindings.
repo_id: the Hugging Face Hub repository id.
verbose: whether to print verbose output. Defaults to `False`.
n_gpu_layers: number of layers to run on the GPU. Defaults to `-1` (use the GPU if available).
disable_cuda_device_placement: whether to disable CUDA device placement. Defaults to `True`.
normalize_embeddings: whether to normalize the embeddings. Defaults to `False`.
seed: RNG seed, -1 for random
n_ctx: Text context, 0 = from model
n_batch: Prompt processing maximum batch size
extra_kwargs: additional dictionary of keyword arguments that will be passed to the
`Llama` class of `llama_cpp` library. Defaults to `{}`.
_model: the `Llama` model instance. This attribute is meant to be used internally
and should not be accessed directly. It will be set in the `load` method.
bikash119 marked this conversation as resolved.
Show resolved Hide resolved

Runtime parameters:
- `n_gpu_layers`: the number of layers to use for the GPU. Defaults to `-1`.
- `verbose`: whether to print verbose output. Defaults to `False`.
- `normalize_embeddings`: whether to normalize the embeddings. Defaults to `False`.
- `extra_kwargs`: additional dictionary of keyword arguments that will be passed to the
`Llama` class of `llama_cpp` library. Defaults to `{}`.
References:
bikash119 marked this conversation as resolved.
Show resolved Hide resolved
- [Offline inference embeddings](https://llama-cpp-python.readthedocs.io/en/stable/#embeddings)

Examples:
Generate sentence embeddings using a local model:

```python
from pathlib import Path
from distilabel.embeddings import LlamaCppEmbeddings

# You can follow along this example downloading the following model running the following
# command in the terminal, that will download the model to the `Downloads` folder:
# curl -L -o ~/Downloads/All-MiniLM-L6-v2-Embedding-GGUF https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/blob/main/all-MiniLM-L6-v2-Q2_K.gguf

model_path = "Downloads/"
model = "all-MiniLM-L6-v2-Q2_K.gguf"
embeddings = LlamaCppEmbeddings(model=model,model_path=str(Path.home() / model_path))

embeddings.load()

results = embeddings.encode(inputs=["distilabel is awesome!", "and Argilla!"])
# [
# [-0.05447685346007347, -0.01623094454407692, ...],
# [4.4889533455716446e-05, 0.044016145169734955, ...],
# ]
```

Generate sentence embeddings using a HuggingFace Hub public model:

```python
from distilabel.embeddings import LlamaCppEmbeddings

repo_id = "second-state/All-MiniLM-L6-v2-Embedding-GGUF"
model = "all-MiniLM-L6-v2-Q5_K_M.gguf"
embeddings = LlamaCppEmbeddings(model=model,repo_id=repo_id)

embeddings.load()

results = embeddings.encode(inputs=["distilabel is awesome!", "and Argilla!"])
# [
# [-0.05447685346007347, -0.01623094454407692, ...],
# [4.4889533455716446e-05, 0.044016145169734955, ...],
# ]
```

Generate sentence embeddings using a HuggingFace Hub private model:

```python
from distilabel.embeddings import LlamaCppEmbeddings

# You need to set environment variable to download private model to the local machine
os.environ["HF_TOKEN"] = "hf_..."

repo_id = "private_repo_id"
model = "model"
embeddings = LlamaCppEmbeddings(model=model,repo_id=repo_id)

embeddings.load()

results = embeddings.encode(inputs=["distilabel is awesome!", "and Argilla!"])
# [
# [-0.05447685346007347, -0.01623094454407692, ...],
# [4.4889533455716446e-05, 0.044016145169734955, ...],
# ]
```

Generate sentence embeddings with cpu:

```python
from pathlib import Path
from distilabel.embeddings import LlamaCppEmbeddings

# You can follow along this example downloading the following model running the following
# command in the terminal, that will download the model to the `Downloads` folder:
# curl -L -o ~/Downloads/All-MiniLM-L6-v2-Embedding-GGUF https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/blob/main/all-MiniLM-L6-v2-Q2_K.gguf

model_path = "Downloads/"
model = "all-MiniLM-L6-v2-Q2_K.gguf"
embeddings = LlamaCppEmbeddings(model=model,model_path=str(Path.home() / model_path), n_gpu_layers=0)

embeddings.load()

results = embeddings.encode(inputs=["distilabel is awesome!", "and Argilla!"])
# [
# [-0.05447685346007347, -0.01623094454407692, ...],
# [4.4889533455716446e-05, 0.044016145169734955, ...],
# ]
```


bikash119 marked this conversation as resolved.
Show resolved Hide resolved
"""

model: str = Field(
description="The name of the model to use for embeddings.",
)

model_path: RuntimeParameter[str] = Field(
default=None,
description="The path to the GGUF quantized model, compatible with the installed version of the `llama.cpp` Python bindings.",
)

repo_id: RuntimeParameter[str] = Field(
default=None, description="The Hugging Face Hub repository id.", exclude=True
)

n_gpu_layers: RuntimeParameter[int] = Field(
default=-1,
description="The number of layers that will be loaded in the GPU.",
)

n_ctx: int = 512
n_batch: int = 512
seed: int = 4294967295

normalize_embeddings: RuntimeParameter[bool] = Field(
default=False,
description="Whether to normalize the embeddings.",
)
verbose: RuntimeParameter[bool] = Field(
default=False,
description="Whether to print verbose output from llama.cpp library.",
)
extra_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(
default_factory=dict,
description="Additional dictionary of keyword arguments that will be passed to the"
" `Llama` class of `llama_cpp` library. See all the supported arguments at: "
"https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__",
)
_model: Optional["Llama"] = PrivateAttr(...)

def load(self) -> None:
"""Loads the `gguf` model using either the path or the Hugging Face Hub repository id."""
super().load()
self.disable_cuda_device_placement = True
CudaDevicePlacementMixin.load(self)

try:
from llama_cpp import Llama
except ImportError as ie:
raise ImportError(
"`llama-cpp-python` package is not installed. Please install it using"
" `pip install llama-cpp-python`."
) from ie

if self.repo_id is not None:
# use repo_id to download the model
from huggingface_hub.utils import validate_repo_id

validate_repo_id(self.repo_id)
self._model = Llama.from_pretrained(
repo_id=self.repo_id,
filename=self.model,
n_gpu_layers=self.n_gpu_layers,
seed=self.seed,
n_ctx=self.n_ctx,
n_batch=self.n_batch,
verbose=self.verbose,
embedding=True,
kwargs=self.extra_kwargs,
)
elif self.model_path is not None:
try:
bikash119 marked this conversation as resolved.
Show resolved Hide resolved
self._model = Llama(
model_path=str(Path(self.model_path) / self.model),
n_gpu_layers=self.n_gpu_layers,
seed=self.seed,
n_ctx=self.n_ctx,
n_batch=self.n_batch,
verbose=self.verbose,
embedding=True,
kwargs=self.extra_kwargs,
)
except Exception:
raise
bikash119 marked this conversation as resolved.
Show resolved Hide resolved
else:
raise ValueError("Either 'model_path' or 'repo_id' must be provided")

def unload(self) -> None:
"""Unloads the `gguf` model."""
CudaDevicePlacementMixin.unload(self)
super().unload()

@property
def model_name(self) -> str:
"""Returns the name of the model."""
return self.model

def encode(self, inputs: List[str]) -> List[List[Union[int, float]]]:
bikash119 marked this conversation as resolved.
Show resolved Hide resolved
"""Generates embeddings for the provided inputs.

Args:
inputs: a list of texts for which an embedding has to be generated.

Returns:
The generated embeddings.
"""
return self._model.embed(inputs, normalize=self.normalize_embeddings)
78 changes: 78 additions & 0 deletions tests/unit/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,10 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import atexit
import os
from typing import TYPE_CHECKING, Any, Dict, List, Union
from urllib.request import urlretrieve

import pytest

Expand Down Expand Up @@ -102,3 +105,78 @@ class DummyTaskOfflineBatchGeneration(DummyTask):
@pytest.fixture
def dummy_llm() -> AsyncLLM:
return DummyAsyncLLM()


@pytest.fixture(scope="session")
def local_llamacpp_model_path(tmp_path_factory):
"""
Session-scoped fixture that provides the local model path for LlamaCpp testing.

The model path can be set using the LLAMACPP_TEST_MODEL_PATH environment variable.
If not set, it downloads a small test model to a temporary directory.
The model is downloaded once per test session and cleaned up after all tests.

To use a custom model:
1. Set the LLAMACPP_TEST_MODEL_PATH environment variable to the path of your model file.
2. Ensure the model file exists at the specified path.

Example:
export LLAMACPP_TEST_MODEL_PATH="/path/to/your/model.gguf"

Args:
tmp_path_factory: Pytest fixture providing a temporary directory factory.

Returns:
str: The path to the local LlamaCpp model file.
"""
print("\nLlamaCpp model path information:")
bikash119 marked this conversation as resolved.
Show resolved Hide resolved

# Check for environment variable first
env_path = os.environ.get("LLAMACPP_TEST_MODEL_PATH")
if env_path:
print(f"Using custom model path from LLAMACPP_TEST_MODEL_PATH: {env_path}")
bikash119 marked this conversation as resolved.
Show resolved Hide resolved
if not os.path.exists(env_path):
raise FileNotFoundError(
f"Custom model file not found at {env_path}. Please ensure the file exists."
)
return env_path

print("LLAMACPP_TEST_MODEL_PATH not set. Using default test model.")
print(
"To use a custom model, set the LLAMACPP_TEST_MODEL_PATH environment variable to the path of your model file."
)
bikash119 marked this conversation as resolved.
Show resolved Hide resolved

# If env var not set, use a small test model
model_name = "all-MiniLM-L6-v2-Q2_K.gguf"
model_url = f"https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/{model_name}"
tmp_path = tmp_path_factory.getbasetemp()
model_path = tmp_path / model_name

if not model_path.exists():
urlretrieve(model_url, model_path)

def cleanup():
if model_path.exists():
os.remove(model_path)

# Register the cleanup function to be called at exit
atexit.register(cleanup)

return str(tmp_path)


def pytest_addoption(parser):
"""
Add a command-line option to pytest for CPU-only testing.
"""
parser.addoption(
"--cpu-only", action="store", default=False, help="Run tests on CPU only"
)


@pytest.fixture
bikash119 marked this conversation as resolved.
Show resolved Hide resolved
def use_cpu(request):
"""
Fixture to determine whether to use CPU based on command-line option.
"""
return request.config.getoption("--cpu-only")
Loading