Skip to content

Commit

Permalink
Update data_classification_policy.rst --- copy edits (grammar, consis… (
Browse files Browse the repository at this point in the history
#1139)

Update data_classification_policy.rst --- copy edits (grammar, consistency, clarity)

Signed-off-by: welisheva22 <welisheva22@gmail.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
  • Loading branch information
2 people authored and csrajmohan committed Aug 29, 2024
1 parent 8776623 commit 27fb8f0
Showing 1 changed file with 10 additions and 10 deletions.
20 changes: 10 additions & 10 deletions docs/docs/data_classification_policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ The section discusses how to properly handle sensitive data in Unitxt in order t
proprietary/confidential/personal data to unauthorized services or 3rd parties. For example, sending sensitive
data for inference by an external API in LLM as Judge metric.

The problem is exacerbated since the person who owns the data and uses the metric in their card,
may not know what 3rd services are used by internally by the metric.
The problem is exacerbated since the person who owns the data and uses the metric in their card
may not know what 3rd party services are used internally by the metric.

To address this Unitxt allows the data owner to specify the data classification of their data, and require that
any metric (or other component) that processes the data, must be explicitly allowed to process data with this classification.
To address this, Unitxt allows the data owner to specify the data classification of their data, and similarly it requires that
any metric (or other component) that processes the data must be explicitly allowed to process data with this classification.


Data classification policy
Expand All @@ -28,14 +28,14 @@ You can define your own data classification identifiers.

Each component that processes data in Unitxt ( operators, metrics, inference engines, etc.) also has
a parameter called `data_classification_policy`. This parameter determines which kinds of data
it can process. The parameter is also a list of string identifiers, which are names of allowed data classification.
it can process. The parameter is also a list of string identifiers, each of which is a name of allowed data classification.

Before processing the data, the component verifies that the `data_classification_policy` of the data meets its `data_classification_policy`.
If the policies for a component include the classification of the data, then the data may be further processed. Otherwise, an error will be raised.
For example, a LLM as judge that calls an external api, may set `data_classification_policy` to `['public']`.
For example, an LLM as judge that calls an external api may set `data_classification_policy` to `['public']`.
If data marked [`confidential`] is passed to the metric, it will not process the data and fail.

If the data has multiple `data_classification_policy`s then the component must be allowed to handle all of them.
If the data has multiple values under `data_classification_policy` then the component must be allowed to handle all of them.
If the `data_classification_policy` is not set, the component can handle all data.

It is possible to override the `data_classification_policy` of a component with an environment variable. See below.
Expand All @@ -45,7 +45,7 @@ Adding `data_classification_policy` for data

Data classification information is added to streams of data by the use of Unitxt loaders.
Existing loaders have default data classification policies. For example, LoadHF sets the policy to `['public']` for datasets
downloaded from the Huggingface and `['proprietary']` for datasets loaded from local files. You can override this by setting
downloaded from the HuggingFace and `['proprietary']` for datasets loaded from local files. You can override this by setting
the `data_classification_policy` parameter of the loader.

The data classification value is added as an additional field to all instances within a stream.
Expand Down Expand Up @@ -105,8 +105,8 @@ Example:
1. **Overriding default policy during environment variable **:

You can override the data classification of artifacts that was saved in the catalog, by setting the the `UNITXT_DATA_CLASSIFICATION_POLICY` env variable accordingly.
It should be of string representation of type `Dict[str, List[str]]`, where a key is a name of a given artifact, and a corresponding value of allowed data classification. For example:
You can override the data classification of artifacts that was saved in the catalog by setting the `UNITXT_DATA_CLASSIFICATION_POLICY` env variable accordingly.
It should be a string representation of type `Dict[str, List[str]]`, where a key is a name of a given artifact, and a corresponding value is the allowed data classification. For example:

.. code-block:: bash
Expand Down

0 comments on commit 27fb8f0

Please sign in to comment.