Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] feat: simplify telemetry #5441

Closed
wants to merge 65 commits into from

Conversation

davidberenstein1957
Copy link
Member

Description

Closes NA

…ent),` instead of direct `_TELEMETRY_CLIENT` import
Copy link

codecov bot commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 92.20779% with 6 lines in your changes missing coverage. Please review.

Project coverage is 91.30%. Comparing base (62b1c12) to head (d59ff6c).
Report is 93 commits behind head on feat/5204-feature-add-huggingface_hubutilssend_telemetry-to-the-argilla-server.

Files with missing lines Patch % Lines
argilla-server/src/argilla_server/telemetry.py 89.47% 4 Missing ⚠️
argilla-server/src/argilla_server/_app.py 33.33% 2 Missing ⚠️
Additional details and impacted files
@@                                                Coverage Diff                                                 @@
##           feat/5204-feature-add-huggingface_hubutilssend_telemetry-to-the-argilla-server    #5441      +/-   ##
==================================================================================================================
+ Coverage                                                                           91.13%   91.30%   +0.17%     
==================================================================================================================
  Files                                                                                 139      139              
  Lines                                                                                5716     5774      +58     
==================================================================================================================
+ Hits                                                                                 5209     5272      +63     
+ Misses                                                                                507      502       -5     
Flag Coverage Δ
argilla-server 91.30% <92.20%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@davidberenstein1957 davidberenstein1957 marked this pull request as ready for review August 29, 2024 17:42
if dataset:
for attr in attributes:
for attr in _RELEVANT_ATTRIBUTES + ["distribution"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The track_crud_dataset method is used for dataset creation. In that case, no fields, questions, or other setting relationships will be provided, so this code is not necessary. Also, the distribution is not a relationship but a column, so the is_relationship_loaded is not needed. You could track the distribution by accessing it directly. I think this simplifies the code readability

Comment on lines 115 to 121
for attr in _RELEVANT_ATTRIBUTES:
if dataset.is_relationship_loaded(attr):
obtained_attr_list = getattr(dataset, attr)
await self.track_resource_size(
crud_action=crud_action, setting_name=f"dataset/{attr}", count=len(obtained_attr_list)
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This info won't be available on dataset creation, so I would say to remove it.

This method is used to track the creation, update, and deletion of dataset settings. These
settings include fields, questions, vectors settings, and metadata properties.

Args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we use Parameters format instead.

Comment on lines +83 to +85
await telemetry_client.track_resource_size(
crud_action="read", setting_name="dataset", count=len(dataset_list)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding the dataset count when listing the datasets for the owner is a bit of a convoluted way of computing the metric. It might make more sense to send this information somewhere else in the system. I think I should rethink these metrics and include them in a second iteration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, not sure about the need for crud_action="read" here since is no related to a resource crud action (we compute a size)

await telemetry_client.track_crud_user(action="list", user=None, is_oauth=False, count=len(users))
for user in users:
await telemetry_client.track_crud_user(action="read", user=user, is_oauth=False)
await telemetry_client.track_resource_size(crud_action="read", setting_name="user", count=len(users))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, computing this metric here can be not easy to understand later. Better if we reduce the number of metrics for the first iteration an think about these stats for a second iteration

):
await authorize(current_user, UserPolicy.list_workspaces)

user = await User.get_or_raise(db, user_id)

if user.is_owner:
workspaces = await accounts.list_workspaces(db)
await telemetry_client.track_resource_size(crud_action="read", setting_name="workspace", count=len(workspaces))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments here.

return {"type": "vector"}
elif isinstance(setting, dict):
return {"type": f"distribution_{setting['strategy']}_{setting['min_submitted']}"}
raise NotImplementedError("Expected a setting to be processed.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this here is a bit scary. We don't want to break endpoint functionally because the telemetry module cannot extract info. Also, in general, we must be sure to not propagate errors when telemetry info.

Comment on lines 99 to 100
if crud_action:
user_agent["request.endpoint.method"] = crud_action
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of this attribute? We only are tracking "create" actions, and my previous PR is already tracking endpoint method info.


async def track_data(self, topic: str, user_agent: dict, include_system_info: bool = True, count: int = 1):
library_name = "argilla"
async def track_data(self, topic: str, count: int = None, crud_action: str = None, user_agent: dict = None) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the track_data is the base method to send telemetry info, I would say to simplify the signature and move the count and the crud_action as part of the data. Also, the name of the user agent reveals implementation details of how the data is sent and is not aligned with the method name.

async def track_data (self, topic, data:dict)
...

await self.track_resource_size(
crud_action=crud_action, setting_name=f"dataset/{attr}", count=len(obtained_attr_list)
)

async def track_crud_dataset_setting(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the telemetry library won't send data if disabled, we should prevent making unnecessary calls and data processing when telemetry is disabled. Also, if there is an error in our code, users have no way to disable the whole telemetry tracking.

frascuchon added a commit that referenced this pull request Sep 3, 2024
# Description
<!-- Please include a summary of the changes and the related issue.
Please also include relevant motivation and context. List any
dependencies that are required for this change. -->

This PR adds the track startup method defined in
#5441 and include
perstitent_storaged_enbled info as part of the system info

**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->

- Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
@frascuchon frascuchon added this to the v2.2.0 milestone Sep 3, 2024
Base automatically changed from feat/5204-feature-add-huggingface_hubutilssend_telemetry-to-the-argilla-server to develop September 3, 2024 09:13
@frascuchon frascuchon changed the title feat: simplify telemetry [DRAFT] feat: simplify telemetry Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants