Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add integration with huggingface_hub.utils.telemetry #5218

Merged
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
431f296
Update `TelemetryClient` to use `huggingface_hub.utils`
davidberenstein1957 Jul 12, 2024
2d4aef4
Update `user``CRUD` telemetry tracking
davidberenstein1957 Jul 12, 2024
71cf614
Update `workspace` CRUD telemetry tracking
davidberenstein1957 Jul 12, 2024
e1924e4
Update `workspace` telemetry from `list_user_workspaces` method
davidberenstein1957 Jul 12, 2024
11a1f85
Fix arguments passed to `track_crud_workspace` in `list_user_workspaces`
davidberenstein1957 Jul 12, 2024
9adfc16
Fix `await` to `telemetry` call
davidberenstein1957 Jul 12, 2024
0109e44
Update `telemetry_client: TelemetryClient = Depends(get_telemetry_cli…
davidberenstein1957 Jul 12, 2024
038d8f6
Add telemetry methods, dataset, workspace, user, settings, records, r…
davidberenstein1957 Jul 15, 2024
dbcebdc
Add telemetry methods `fields`
davidberenstein1957 Jul 15, 2024
5581837
Add telemetry methods `metadata_properties`
davidberenstein1957 Jul 15, 2024
9d7316d
Add telemetry methods `questions`
davidberenstein1957 Jul 15, 2024
01d8af7
Add telemetry methods `records`
davidberenstein1957 Jul 15, 2024
cea525c
Add telemetry methods to `responses`
davidberenstein1957 Jul 15, 2024
aa9c6ca
Add telemetry methods to ùsers`
davidberenstein1957 Jul 15, 2024
b694ce1
Add telemetry `suggestions`
davidberenstein1957 Jul 15, 2024
ebf139e
Update `track_crud_dataset_setting` processing
davidberenstein1957 Jul 16, 2024
fc2055c
Merge branch 'develop' into feat/5204-feature-add-huggingface_hubutil…
davidberenstein1957 Jul 16, 2024
77dd130
Add `enable_telemetry` check
davidberenstein1957 Jul 16, 2024
6979112
Remove `disable_send`
davidberenstein1957 Jul 16, 2024
9bffe18
Deprecate `ARGILLA_ENABLE_TELEMETRY` env var
davidberenstein1957 Jul 16, 2024
e6763cc
Update `test_telemetry`
davidberenstein1957 Jul 16, 2024
a94ab7c
Add enable telemetry to post_init
davidberenstein1957 Jul 16, 2024
a6f7c0f
Add `UUID` to `str` covnersion
davidberenstein1957 Jul 16, 2024
a70c590
Run tests with enabled telemetry
davidberenstein1957 Jul 16, 2024
d762dc3
Remove `telemetry` client
davidberenstein1957 Jul 16, 2024
4addc7b
Fix tests errors
davidberenstein1957 Jul 16, 2024
ac7601c
Update `test_telemetry` fixture
davidberenstein1957 Jul 16, 2024
c72c4b0
Update disable telemetry env var
davidberenstein1957 Jul 16, 2024
538c268
Fix tests dataset creation
davidberenstein1957 Jul 17, 2024
0b167eb
Fix failing tests due to unloaded DatabaseModels
davidberenstein1957 Jul 17, 2024
4fcbbd6
Add tests telemetry crud datasets
davidberenstein1957 Jul 17, 2024
8428939
Update tests coverage telemetry tracking
davidberenstein1957 Jul 17, 2024
daea0b2
Merge branch 'develop' into feat/5204-feature-add-huggingface_hubutil…
davidberenstein1957 Jul 17, 2024
18c5d0f
Update argilla-server/src/argilla_server/settings.py
davidberenstein1957 Jul 17, 2024
0230bfa
Remove Python version from sytem info
davidberenstein1957 Jul 18, 2024
cbacf35
Update tests to also check assert call track_data
davidberenstein1957 Jul 18, 2024
2473c82
Add documentation for telemetry information (#5253)
davidberenstein1957 Jul 22, 2024
35c9e43
Add async telemetry client
davidberenstein1957 Jul 22, 2024
df9a0fc
Update `test_telemetry`
davidberenstein1957 Jul 22, 2024
c64cc29
Update list-like to basic CRUD
davidberenstein1957 Jul 23, 2024
8b0c752
Merge branch 'develop' into feat/5204-feature-add-huggingface_hubutil…
davidberenstein1957 Aug 19, 2024
0ed746b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 19, 2024
fef47b8
Resolved utcnow() deprecation
davidberenstein1957 Aug 19, 2024
b982497
Merge branch 'feat/5204-feature-add-huggingface_hubutilssend_telemetr…
davidberenstein1957 Aug 19, 2024
e332696
Update to always add error type to overview
davidberenstein1957 Aug 19, 2024
a1c83a1
Add dataset distribution tracking to telemetry
davidberenstein1957 Aug 19, 2024
17b0403
Revert UTC change
davidberenstein1957 Aug 19, 2024
f6e9f38
fix failing tests
davidberenstein1957 Aug 19, 2024
68d6e53
fix failing tests
davidberenstein1957 Aug 19, 2024
de181a9
Remove server id from telemetry to be more GDPR compliant
davidberenstein1957 Aug 19, 2024
b823e2e
Update tlemetry workflow
davidberenstein1957 Aug 20, 2024
bb2dfd5
Merge branch 'develop' into feat/5204-feature-add-huggingface_hubutil…
frascuchon Aug 26, 2024
cad43a1
chore: add huggingface_hub to dependencies
davidberenstein1957 Aug 26, 2024
5fd9492
Merge branch 'feat/5204-feature-add-huggingface_hubutilssend_telemetr…
davidberenstein1957 Aug 26, 2024
659d9a2
docs: update telemetry sections
davidberenstein1957 Aug 27, 2024
c725352
update: usage from record_subtopic to record_suggestions and record_r…
davidberenstein1957 Aug 27, 2024
93d46b1
refactor: introduced track_error specific method
davidberenstein1957 Aug 27, 2024
f5901b9
refactor: name search operation like "search"
davidberenstein1957 Aug 27, 2024
f0019cc
[FEAT] argilla server: add basic endpoints telemetry support (#5437)
frascuchon Sep 2, 2024
43ef896
Merge branch 'develop' into feat/5204-feature-add-huggingface_hubutil…
frascuchon Sep 2, 2024
8ce3621
chore: Remove all non-general endpoint telemetry related-code
frascuchon Sep 2, 2024
c7c22f8
Update argilla/mkdocs.yml
frascuchon Sep 2, 2024
32f3baa
chore: Revert doc change
frascuchon Sep 2, 2024
0c9c608
chore: revert doc changes
frascuchon Sep 2, 2024
00b6caa
chore: Remove unused attribute
frascuchon Sep 2, 2024
942455d
[FEAT] `argilla server`: add user and server id on telemetry metrics …
frascuchon Sep 2, 2024
0b825a0
[FEAT] `argilla server`: track servert startup (#5447)
frascuchon Sep 3, 2024
d93c27b
chore: Align the user.id registration
frascuchon Sep 3, 2024
cd45e6b
chore: review docs
frascuchon Sep 3, 2024
0c51124
chore: Update CHANGELOG
frascuchon Sep 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/argilla-frontend.deploy-environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ jobs:
ADMIN_API_KEY=${{ steps.credentials.outputs.admin }}
ANNOTATOR_PASSWORD=${{ steps.credentials.outputs.annotator }}
ANNOTATOR_API_KEY=${{ steps.credentials.outputs.annotator }}
ARGILLA_ENABLE_TELEMETRY=0
HF_HUB_DISABLE_TELEMETRY=1
frascuchon marked this conversation as resolved.
Show resolved Hide resolved
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved
API_BASE_URL=https://dev.argilla.io/

- name: Post credentials in Slack
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/argilla-server.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ jobs:
- 5432:5432

env:
ARGILLA_ENABLE_TELEMETRY: 0
HF_HUB_DISABLE_TELEMETRY: 1

steps:
- name: Checkout Code 🛎
Expand Down
13 changes: 8 additions & 5 deletions argilla-server/pdm.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 3 additions & 1 deletion argilla-server/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ dependencies = [
"typer >= 0.6.0, < 0.10.0", # spaCy only supports typer<0.10.0
"packaging>=23.2",
"psycopg2-binary>=2.9.9",
# For Telemetry
"huggingface_hub>=0.13,<1",
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved
]

[project.optional-dependencies]
Expand Down Expand Up @@ -106,7 +108,7 @@ log_format = "%(asctime)s %(name)s %(levelname)s %(message)s"
log_date_format = "%Y-%m-%d %H:%M:%S"
log_cli = "True"
testpaths = ["tests"]
env = ["ARGILLA_ENABLE_TELEMETRY=0"]
env = ["HF_HUB_DISABLE_TELEMETRY=1"]

[tool.coverage.run]
concurrency = ["greenlet", "thread", "multiprocessing"]
Expand Down
6 changes: 5 additions & 1 deletion argilla-server/src/argilla_server/_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,10 +174,14 @@ def show_telemetry_warning():
" https://docs.argilla.io/latest/reference/argilla-server/telemetry/\n\n"
"Telemetry is currently enabled. If you want to disable it, you can configure\n"
"the environment variable before relaunching the server:\n\n"
f'{"#set ARGILLA_ENABLE_TELEMETRY=0" if os.name == "nt" else "$>export ARGILLA_ENABLE_TELEMETRY=0"}'
f'{"#set HF_HUB_DISABLE_TELEMETRY=1" if os.name == "nt" else "$>export HF_HUB_DISABLE_TELEMETRY=1"}'
)
_LOGGER.warning(message)

message += "\n\n "
message += "#set HF_HUB_DISABLE_TELEMETRY=1" if os.name == "nt" else "$>export HF_HUB_DISABLE_TELEMETRY=1"
message += "\n"


async def _create_oauth_allowed_workspaces(db: AsyncSession):
from argilla_server.security.settings import settings as security_settings
Expand Down
108 changes: 93 additions & 15 deletions argilla-server/src/argilla_server/api/handlers/v1/datasets/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,14 @@
from argilla_server.api.policies.v1 import DatasetPolicy, MetadataPropertyPolicy, authorize, is_authorized
from argilla_server.api.schemas.v1.datasets import (
Dataset as DatasetSchema,
UsersProgress,
)
from argilla_server.api.schemas.v1.datasets import (
DatasetCreate,
DatasetMetrics,
DatasetProgress,
Datasets,
DatasetUpdate,
UsersProgress,
)
from argilla_server.api.schemas.v1.fields import Field, FieldCreate, Fields
from argilla_server.api.schemas.v1.metadata_properties import (
Expand Down Expand Up @@ -71,6 +71,7 @@ async def _filter_metadata_properties_by_policy(
async def list_current_user_datasets(
*,
db: AsyncSession = Depends(get_async_db),
telemetry_client: TelemetryClient = Depends(get_telemetry_client),
workspace_id: Optional[UUID] = None,
current_user: User = Security(auth.get_current_user),
):
Expand All @@ -85,34 +86,64 @@ async def list_current_user_datasets(
else:
dataset_list = await datasets.list_datasets_by_workspace_id(db, workspace_id)

await telemetry_client.track_crud_dataset(action="me/list", dataset=None, count=len(dataset_list))

return Datasets(items=dataset_list)


@router.get("/datasets/{dataset_id}/fields", response_model=Fields)
async def list_dataset_fields(
*, db: AsyncSession = Depends(get_async_db), dataset_id: UUID, current_user: User = Security(auth.get_current_user)
*,
db: AsyncSession = Depends(get_async_db),
telemetry_client: TelemetryClient = Depends(get_telemetry_client),
dataset_id: UUID,
current_user: User = Security(auth.get_current_user),
):
dataset = await Dataset.get_or_raise(db, dataset_id, options=[selectinload(Dataset.fields)])

await authorize(current_user, DatasetPolicy.get(dataset))

for field in dataset.fields:
await telemetry_client.track_crud_dataset_setting(
action="read", dataset=dataset, setting_name="fields", setting=field
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how these telemetry requests are done? Are they synchronous? UDP?

We should check that we are not spending a lot of time executing these requests so we are not adding additional time to the API endpoint requests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await telemetry_client.track_crud_dataset_setting(
action="list", dataset=dataset, setting_name="fields", count=len(dataset.fields)
)

return Fields(items=dataset.fields)


@router.get("/datasets/{dataset_id}/vectors-settings", response_model=VectorsSettings)
async def list_dataset_vector_settings(
*, db: AsyncSession = Depends(get_async_db), dataset_id: UUID, current_user: User = Security(auth.get_current_user)
*,
db: AsyncSession = Depends(get_async_db),
telemetry_client: TelemetryClient = Depends(get_telemetry_client),
dataset_id: UUID,
current_user: User = Security(auth.get_current_user),
):
dataset = await Dataset.get_or_raise(db, dataset_id, options=[selectinload(Dataset.vectors_settings)])

await authorize(current_user, DatasetPolicy.get(dataset))

for vectors_setting in dataset.vectors_settings:
await telemetry_client.track_crud_dataset_setting(
action="read", dataset=dataset, setting_name="vectors_settings", setting=vectors_setting
)
await telemetry_client.track_crud_dataset_setting(
action="list", dataset=dataset, setting_name="vectors_settings", count=len(dataset.vectors_settings)
)

return VectorsSettings(items=dataset.vectors_settings)


@router.get("/me/datasets/{dataset_id}/metadata-properties", response_model=MetadataProperties)
async def list_current_user_dataset_metadata_properties(
*, db: AsyncSession = Depends(get_async_db), dataset_id: UUID, current_user: User = Security(auth.get_current_user)
*,
db: AsyncSession = Depends(get_async_db),
telemetry_client: TelemetryClient = Depends(get_telemetry_client),
dataset_id: UUID,
current_user: User = Security(auth.get_current_user),
):
dataset = await Dataset.get_or_raise(db, dataset_id, options=[selectinload(Dataset.metadata_properties)])

Expand All @@ -122,17 +153,31 @@ async def list_current_user_dataset_metadata_properties(
current_user, dataset.metadata_properties
)

for metadata_property in filtered_metadata_properties:
await telemetry_client.track_crud_dataset_setting(
action="read", dataset=dataset, setting_name="me/metadata_properties", setting=metadata_property
)
await telemetry_client.track_crud_dataset_setting(
action="list", dataset=dataset, setting_name="me/metadata_properties", count=len(filtered_metadata_properties)
)

return MetadataProperties(items=filtered_metadata_properties)


@router.get("/datasets/{dataset_id}", response_model=DatasetSchema)
async def get_dataset(
*, db: AsyncSession = Depends(get_async_db), dataset_id: UUID, current_user: User = Security(auth.get_current_user)
*,
db: AsyncSession = Depends(get_async_db),
telemetry_client: TelemetryClient = Depends(get_telemetry_client),
dataset_id: UUID,
current_user: User = Security(auth.get_current_user),
):
dataset = await Dataset.get_or_raise(db, dataset_id)

await authorize(current_user, DatasetPolicy.get(dataset))

await telemetry_client.track_crud_dataset(action="read", dataset=dataset)

return dataset


Expand Down Expand Up @@ -184,18 +229,24 @@ async def get_dataset_users_progress(
async def create_dataset(
*,
db: AsyncSession = Depends(get_async_db),
telemetry_client: TelemetryClient = Depends(get_telemetry_client),
dataset_create: DatasetCreate,
current_user: User = Security(auth.get_current_user),
):
await authorize(current_user, DatasetPolicy.create(dataset_create.workspace_id))

return await datasets.create_dataset(db, dataset_create.dict())
dataset = await datasets.create_dataset(db, dataset_create.dict())

await telemetry_client.track_crud_dataset(action="create", dataset=dataset)

return dataset


@router.post("/datasets/{dataset_id}/fields", status_code=status.HTTP_201_CREATED, response_model=Field)
async def create_dataset_field(
*,
db: AsyncSession = Depends(get_async_db),
telemetry_client: TelemetryClient = Depends(get_telemetry_client),
dataset_id: UUID,
field_create: FieldCreate,
current_user: User = Security(auth.get_current_user),
Expand All @@ -204,7 +255,13 @@ async def create_dataset_field(

await authorize(current_user, DatasetPolicy.create_field(dataset))

return await datasets.create_field(db, dataset, field_create)
field = await datasets.create_field(db, dataset, field_create)

await telemetry_client.track_crud_dataset_setting(
action="create", setting_name="fields", dataset=dataset, setting=field
)

return field


@router.post(
Expand All @@ -214,6 +271,7 @@ async def create_dataset_metadata_property(
*,
db: AsyncSession = Depends(get_async_db),
search_engine: SearchEngine = Depends(get_search_engine),
telemetry_client: TelemetryClient = Depends(get_telemetry_client),
dataset_id: UUID,
metadata_property_create: MetadataPropertyCreate,
current_user: User = Security(auth.get_current_user),
Expand All @@ -222,7 +280,13 @@ async def create_dataset_metadata_property(

await authorize(current_user, DatasetPolicy.create_metadata_property(dataset))

return await datasets.create_metadata_property(db, search_engine, dataset, metadata_property_create)
metadata_property = await datasets.create_metadata_property(db, search_engine, dataset, metadata_property_create)

await telemetry_client.track_crud_dataset_setting(
action="create", setting_name="metadata_properties", dataset=dataset, setting=metadata_property
)

return metadata_property


@router.post(
Expand All @@ -232,6 +296,7 @@ async def create_dataset_vector_settings(
*,
db: AsyncSession = Depends(get_async_db),
search_engine: SearchEngine = Depends(get_search_engine),
telemetry_client: TelemetryClient = Depends(get_telemetry_client),
dataset_id: UUID,
vector_settings_create: VectorSettingsCreate,
current_user: User = Security(auth.get_current_user),
Expand All @@ -240,7 +305,13 @@ async def create_dataset_vector_settings(

await authorize(current_user, DatasetPolicy.create_vector_settings(dataset))

return await datasets.create_vector_settings(db, search_engine, dataset, vector_settings_create)
vector_setting = await datasets.create_vector_settings(db, search_engine, dataset, vector_settings_create)

await telemetry_client.track_crud_dataset_setting(
action="create", setting_name="vectors_settings", dataset=dataset, setting=vector_setting
)

return vector_setting


@router.put("/datasets/{dataset_id}/publish", response_model=DatasetSchema)
Expand All @@ -267,10 +338,7 @@ async def publish_dataset(

dataset = await datasets.publish_dataset(db, search_engine, dataset)

telemetry_client.track_data(
action="PublishedDataset",
data={"questions": list(set([question.settings["type"] for question in dataset.questions]))},
)
await telemetry_client.track_crud_dataset(action="create", dataset=dataset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not "publish" instead of "create" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine by me, I wanted to align things as much as possible within the same topic URLs, but I will differentiate a bit more and potentially also create a separate one for "list".


return dataset

Expand All @@ -280,20 +348,26 @@ async def delete_dataset(
*,
db: AsyncSession = Depends(get_async_db),
search_engine: SearchEngine = Depends(get_search_engine),
telemetry_client: TelemetryClient = Depends(get_telemetry_client),
dataset_id: UUID,
current_user: User = Security(auth.get_current_user),
):
dataset = await Dataset.get_or_raise(db, dataset_id)

await authorize(current_user, DatasetPolicy.delete(dataset))

return await datasets.delete_dataset(db, search_engine, dataset)
dataset = await datasets.delete_dataset(db, search_engine, dataset)

await telemetry_client.track_crud_dataset(action="delete", dataset=dataset)

return dataset


@router.patch("/datasets/{dataset_id}", response_model=DatasetSchema)
async def update_dataset(
*,
db: AsyncSession = Depends(get_async_db),
telemetry_client: TelemetryClient = Depends(get_telemetry_client),
dataset_id: UUID,
dataset_update: DatasetUpdate,
current_user: User = Security(auth.get_current_user),
Expand All @@ -302,4 +376,8 @@ async def update_dataset(

await authorize(current_user, DatasetPolicy.update(dataset))

return await datasets.update_dataset(db, dataset, dataset_update.dict(exclude_unset=True))
dataset = await datasets.update_dataset(db, dataset, dataset_update.dict(exclude_unset=True))

await telemetry_client.track_crud_dataset(action="update", dataset=dataset)

return dataset
Loading
Loading