-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROPOSAL] Obfuscate processor #1952
Comments
This is a great idea! In a recent customer conversation, the customer requested a capability where they wanted to provide such a PII filter / masking criteria at ingest time , rather than an administrator setting up access control rules post index. This would give customers control over what data is allowed to be ingested in to their clusters. |
Thanks for the feedback @jimishs ! In this proposal, Data Prepper would look at a user-defined field to get the sensitive data. Then it would obfuscate it throughout the event (which eventually becomes an OpenSearch document). Does this sound like an approach that would help your customer? |
Do we need to support obfuscate by patterns? Maybe regex pattern for start. For example, given the field
Two benefits:
Especially if we can't predict a gork pattern for the source field. For example, to support both below two log messages.
Thanks. |
And patterns can be optional. By default, it's the full source field. If patterns are defined, only obfuscate for those patterns. |
I would like to propose below design for this processor. Thoughts? @dlvenable Basic UsageThe basis usage of the processor is as below pipeline:
source:
http:
processor:
- obfuscate:
entries:
- key: "log"
pattern: "[A-Za-z0-9+_.-]+@(\\S+)"
obfuscation_character: "*"
obfuscation_length: 3
unobfuscated_length: 0
new_key: "newlog"
- key: "email"
sink:
- stdout: Take below input {
"id": 1,
"email": "abc@example.com",
"log": "My name is Bob and my email address is abc@example.com"
} When run, the processor will parse the message into the following output: {
"id": 1,
"email": "***",
"log": "My name is Bob and my email address is abc@example.com",
"newlog": "My name is Bob and my email address is ***"
} And we can define some common patterns to simply the use, such as Email address, so instead of ConfigurationBelow are the list of configuration options.
FAQ:Q1: Can this processor auto-detect the sensitive data to be obfuscated. The answer is No. This processor is essentially a mutate of string based on the pattern provided by users. There is no Q2: What are the differences between this one and the SubstituteStringProcessor. This processor provide more options such as obfuscation character and length etc. to substitute the string. Q3: Can this support one entry with multiple patterns. e.g. I may have a field that contains multiple patterns to be obfuscated (like Email, IP, etc.) Considering there are too many levels (4) of loops in the code, it's not supported, the workaround is to use multiple Q4: Do we need to support Grok Patterns rather than Regex Patterns. Grok patterns is essentially regex patterns, it may be supported in the future so that we don't have to reinvent the And we can define some common patterns to be used, such as Email address, and we can used Grok like expression for those |
@daixba , Thank you for the proposal here! Overall, this sounds like a great solution. I have a few comments on it. This is somewhat similar to my original proposal. However, you add the One part of the original proposal was to correct other fields that have the same value. This appears it may be absent in your modified proposal. Let's say we have the following configuration:
With the same input:
We can detect that Thus resulting in the following output.
Can we retain the approach of obfuscating all fields? Thoughts on this? One other thought is to remove the In the example, you gave, we could have the following instead.
|
Thanks David for the feedback. Yes, this is the implementation of your proposed design. Regarding your comments.
I am not sure if this is common case to obfuscate all fields by default, this is helpful if they don't know exactly which fields may contain sensitive data. But I think maybe we can add this as a enhancement when users are asking for it, then we can have ask for more clarifications to see how it should be implemented. Two reasons I believe we can leave this for now:
For now, if users need such feature, they can add multiple processor entries in the yaml to all (potential) fields.
I agreed with this, it's simpler in terms of both usage and implementation. |
It's helpful to support multiple patterns of same field, and considering this processor may be extended with more actions (apart from mask) in the future. Hence, some changes proposed to the configuration options. Example usage as below pipeline:
source:
http:
processor:
- obfuscate:
source: "log"
target: "new_log"
patterns:
- "[A-Za-z0-9+_.-]+@([\\w-]+\\.)+[\\w-]{2,4}"
action:
mask:
mask_character: "#"
mask_character_length: 6
- obfuscate:
source: "phone"
sink:
- stdout: |
Is your feature request related to a problem? Please describe.
Many pipelines receive sensitive data that needs to be removed or obfuscated. Data Prepper pipeline authors can do this somewhat manually with some of the existing processors like
remove_field
. However, the data can exist in multiple fields so this can be tedious.Describe the solution you'd like
I'd like a processor that takes as input a list of fields which contain sensitive data. It uses the values from those fields to obfuscate or remove data from other fields.
Give it a log like the following.
Grok will update the event to look like the following.
Then this event goes into the proposed
obfuscate
processor. It pulls the stringsbob
andbob@example.org
. Then it looks for it all fields, replacing those values. Thus, the output Event would look like the following.This processor could have some options.
obfuscation_character
-*
by defaultobfuscation_length
-3
by defaultunobfuscated_length
- Leave some characters in place. For example if this value were 8, then we'd get***mple.org
fromexample.org
. Default is0
.retain_source_fields
- If set totrue
, it would keep thename
andemail_address
values, unobfuscated. By default isfalse
This processor could also have special substitution rules to mask data based on characters. For example,
bob@example.org
could be made to***@*******.***
. Perhaps combined with other rules we could even have generate values such as***@*******.org
.Describe alternatives you've considered (Optional)
Using existing processors for find/replace.
The text was updated successfully, but these errors were encountered: