Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Obfuscate processor #1952

Closed
dlvenable opened this issue Oct 20, 2022 · 8 comments · Fixed by #2752
Closed

[PROPOSAL] Obfuscate processor #1952

dlvenable opened this issue Oct 20, 2022 · 8 comments · Fixed by #2752
Assignees
Labels
enhancement New feature or request plugin - processor A plugin to manipulate data in the data prepper pipeline.
Milestone

Comments

@dlvenable
Copy link
Member

Is your feature request related to a problem? Please describe.

Many pipelines receive sensitive data that needs to be removed or obfuscated. Data Prepper pipeline authors can do this somewhat manually with some of the existing processors like remove_field. However, the data can exist in multiple fields so this can be tedious.

Describe the solution you'd like

I'd like a processor that takes as input a list of fields which contain sensitive data. It uses the values from those fields to obfuscate or remove data from other fields.

processor:
  grok:
    match:
      log: ["%{NOTSPACE:some_field} %{NOTSPACE:name} %{NOTSPACE:email_address} %{GREEDYDATA:more_data}"]
  obfuscate:
   source_fields:
     - name
     - email_address

Give it a log like the following.

"start_login bob bob@example.org the rest of the message includes bob and an email bob@example.org"

Grok will update the event to look like the following.

log : "start_login bob bob@example.org the rest of the message includes bob and an email bob@example.org"
some_field : "start_login"
name: "bob"
email_address: "bob@example.org"
more_data: "the rest of the message includes bob and an email bob@example.org"

Then this event goes into the proposed obfuscate processor. It pulls the strings bob and bob@example.org. Then it looks for it all fields, replacing those values. Thus, the output Event would look like the following.

log : "start_login *** *** the rest of the message includes *** and an email ***"
some_field : "start_login"
more_data: "the rest of the message includes *** and an email ***"

This processor could have some options.

  • obfuscation_character - * by default
  • obfuscation_length - 3 by default
  • unobfuscated_length - Leave some characters in place. For example if this value were 8, then we'd get ***mple.org from example.org. Default is 0.
  • retain_source_fields - If set to true, it would keep the name and email_address values, unobfuscated. By default is false

This processor could also have special substitution rules to mask data based on characters. For example, bob@example.org could be made to ***@*******.***. Perhaps combined with other rules we could even have generate values such as ***@*******.org.

Describe alternatives you've considered (Optional)

Using existing processors for find/replace.

@dlvenable dlvenable added enhancement New feature or request plugin - processor A plugin to manipulate data in the data prepper pipeline. and removed untriaged labels Nov 3, 2022
@jimishs
Copy link

jimishs commented May 9, 2023

This is a great idea! In a recent customer conversation, the customer requested a capability where they wanted to provide such a PII filter / masking criteria at ingest time , rather than an administrator setting up access control rules post index. This would give customers control over what data is allowed to be ingested in to their clusters.

@dlvenable
Copy link
Member Author

Thanks for the feedback @jimishs !

In this proposal, Data Prepper would look at a user-defined field to get the sensitive data. Then it would obfuscate it throughout the event (which eventually becomes an OpenSearch document).

Does this sound like an approach that would help your customer?

@daixba
Copy link
Contributor

daixba commented May 10, 2023

Do we need to support obfuscate by patterns? Maybe regex pattern for start.

For example, given the field logs: start_login bob bob@example.org the rest of the message includes bob and an email bob@example.org, indead of adding an extra grok expression, we merged the two steps into one, something similar to

processor:
  ...
  obfuscate:
     source_fields:
        - key: logs
           patterns:
             - <Email Pattern>
             - <...>
...

Two benefits:

  1. User don't need an extra step to parse the fields first before using obfuscate processor.
  2. The source field may just be a regular text and don't have certain formats.

Especially if we can't predict a gork pattern for the source field. For example, to support both below two log messages.

log: "start_login bob bob@example.org the rest of the message includes bob and an email bob@example.org"

log: "bob (email: bob@example.org) the rest of the message includes an email bob@example.org and a name bob"

Thanks.

@daixba
Copy link
Contributor

daixba commented May 10, 2023

And patterns can be optional. By default, it's the full source field. If patterns are defined, only obfuscate for those patterns.

@daixba
Copy link
Contributor

daixba commented May 11, 2023

I would like to propose below design for this processor. Thoughts? @dlvenable

Basic Usage

The basis usage of the processor is as below

pipeline:
  source:
    http:
  processor:
    - obfuscate:
        entries:
          - key: "log"
            pattern: "[A-Za-z0-9+_.-]+@(\\S+)"
            obfuscation_character: "*"
            obfuscation_length: 3
            unobfuscated_length: 0
            new_key: "newlog"
          - key: "email"
  sink:
    - stdout:

Take below input

{
  "id": 1,
  "email": "abc@example.com",
  "log": "My name is Bob and my email address is abc@example.com"
}

When run, the processor will parse the message into the following output:

{
  "id": 1,
  "email": "***",
  "log": "My name is Bob and my email address is abc@example.com",
  "newlog": "My name is Bob and my email address is ***"
}

And we can define some common patterns to simply the use, such as Email address, so instead of
providing pattern: "[A-Za-z0-9+_.-]+@(\\S+)", user can also use pattern: %{EMAIL_ADDRESS} (The format can be
further discussed).

Configuration

Below are the list of configuration options.

  • entries - (required) - A list of entries to add to an event
    • key - (required) - The key to be obfuscated
    • pattern - (optional) - Regex Pattern. Only the parts that matched the Regex Pattern to be obfuscated. If not
      provided, the full field will be obfuscated.
    • obfuscation_character - (optional) - Default to "*"
    • obfuscation_length - (optional) - Default to 3. There will be n numbers of obfuscation characters, e.g. '***'
    • unobfuscated_length - (optional) - Default to 0. Leave some characters in place. For example if this value were
      8, then we'd get ***mple.org from example.org.
    • new_key - (optional) - Store the obfuscated value as a new key, leave the current field unchanged.

FAQ:

Q1: Can this processor auto-detect the sensitive data to be obfuscated.

The answer is No. This processor is essentially a mutate of string based on the pattern provided by users. There is no
NLP feature provided to auto-detect sensitive data in ths processor.

Q2: What are the differences between this one and the SubstituteStringProcessor.

This processor provide more options such as obfuscation character and length etc. to substitute the string.

Q3: Can this support one entry with multiple patterns.

e.g. I may have a field that contains multiple patterns to be obfuscated (like Email, IP, etc.)

Considering there are too many levels (4) of loops in the code, it's not supported, the workaround is to use multiple
entries for the same key.

Q4: Do we need to support Grok Patterns rather than Regex Patterns.

Grok patterns is essentially regex patterns, it may be supported in the future so that we don't have to reinvent the
wheels. But for start, only regex patterns is supported.

And we can define some common patterns to be used, such as Email address, and we can used Grok like expression for those
common patterns, such as %{EMAIL_ADDRESS}

@dlvenable
Copy link
Member Author

dlvenable commented May 15, 2023

@daixba , Thank you for the proposal here! Overall, this sounds like a great solution. I have a few comments on it.

This is somewhat similar to my original proposal. However, you add the pattern which obfuscates only the existing pattern.

One part of the original proposal was to correct other fields that have the same value. This appears it may be absent in your modified proposal.

Let's say we have the following configuration:

pipeline:
  source:
    http:
  processor:
    - obfuscate:
        entries:
          - key: "email"
            obfuscation_character: "*"
            obfuscation_length: 3
            unobfuscated_length: 0
  sink:
    - stdout:

With the same input:

{
  "id": 1,
  "email": "abc@example.com",
  "log": "My name is Bob and my email address is abc@example.com"
}

We can detect that email was indeed the field to obfuscate. It should be able to obfuscate the value in log (and any other field).

Thus resulting in the following output.

{
  "id": 1,
  "email": "***",
  "log": "My name is Bob and my email address is abc@example.com",
  "newlog": "My name is Bob and my email address is ***"
}

Can we retain the approach of obfuscating all fields? Thoughts on this?


One other thought is to remove the entries. I'm not sure there is much to gain from having one processor with multiple entries rather than multiple processors. And I think having some of the complicated YAML is difficult for pipeline authors to get running with.

In the example, you gave, we could have the following instead.

pipeline:
  source:
    http:
  processor:
    - obfuscate:
         key: "log"
         pattern: "[A-Za-z0-9+_.-]+@(\\S+)"
         obfuscation_character: "*"
         obfuscation_length: 3
         unobfuscated_length: 0
         new_key: "newlog"
    - obfuscate:
         key: "email"
  sink:
    - stdout:

@daixba
Copy link
Contributor

daixba commented May 16, 2023

Thanks David for the feedback. Yes, this is the implementation of your proposed design.

Regarding your comments.

  1. The approach of obfuscating all fields

I am not sure if this is common case to obfuscate all fields by default, this is helpful if they don't know exactly which fields may contain sensitive data. But I think maybe we can add this as a enhancement when users are asking for it, then we can have ask for more clarifications to see how it should be implemented.

Two reasons I believe we can leave this for now:

  • Performance: We have to iterate all keys and obfuscate one by one, this sure will slow down the performance
  • Conflict of retaining source fields. I have added new_key as an option so that user can leave the source key unchanged. If we want both obfuscation to all fields and retaining source fields, it may be confusing of the strange result, as in your example, some emails are obfuscated and some are not.

For now, if users need such feature, they can add multiple processor entries in the yaml to all (potential) fields.

  1. Remove of entries

I agreed with this, it's simpler in terms of both usage and implementation.

@daixba daixba mentioned this issue May 16, 2023
4 tasks
@dlvenable dlvenable added this to the v2.3 milestone May 16, 2023
@daixba daixba mentioned this issue May 25, 2023
4 tasks
@daixba
Copy link
Contributor

daixba commented May 25, 2023

It's helpful to support multiple patterns of same field, and considering this processor may be extended with more actions (apart from mask) in the future. Hence, some changes proposed to the configuration options.

Example usage as below

pipeline:
  source:
    http:
  processor:
    - obfuscate:
        source: "log"
        target: "new_log"
        patterns:
          - "[A-Za-z0-9+_.-]+@([\\w-]+\\.)+[\\w-]{2,4}"
        action:
          mask:
            mask_character: "#"
            mask_character_length: 6
    - obfuscate:
        source: "phone"
  sink:
    - stdout:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request plugin - processor A plugin to manipulate data in the data prepper pipeline.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants