[Search Pipelines] Add a processor that provides fine-grained control over what queries are allowed #10938

msfroh · 2023-10-26T01:48:04Z

Is your feature request related to a problem? Please describe.
Years ago, I worked on a search service where my team would vend search capabilities to other teams around the company. Since my team would be woken up in the middle of the night if/when people sent very expensive queries that would use all CPU or memory, we built functionality to block any remotely-expensive query by default. We'd just throw an exception saying, "We are not going to let you run that kind of query unless you talk with us and explain your use-case." Based on the particular use-case, we would usually suggest a different, better way to solve the give problem, but sometimes we would allow e.g. a wildcard query on one or more specific fields.

OpenSearch has the search.allow_expensive_queries cluster setting, but that's basically a giant on/off switch that lacks any nuance.

Describe the solution you'd like
I would like to be able to configure what query types (and possibly other features, like aggregations) are allowed, on what fields, and under what circumstances. Ideally, I'd like even more specificity than the safeguards component that my old team built (which was limited to query types or features on individual fields). I'd like to be able to say "Prefix queries are okay on this field, but only if the prefix has length 3 or more".

I think a search pipeline SearchRequestProcessor would be the ideal place to configure something like that.

Essentially, I'm thinking we could define a chain of rules for each query type like:

# Hand-wavy brainstorm idea for defining rules to restrict query/aggregation types.
# Whatever we build will probably look different. Suggestions welcome!

PUT /_search/pipeline/allowed_queries_pipeline
{
  "request_processors": [
    {
      "restrict_queries" : {
        "query_types": {
          "prefix" : [
            {
              "field": "foo",
              "value_length" : ">=3",
              "allow": true
            },
            {
              "field": "bar",
              "value_length" : ">=5",
              "allow": true
            },
            {
              "deny" : true
            }
          ],
          "wildcard": [
            {
              "field" : "low_cardinality_field",
              "allow": true
            },
            { "deny" : true }
          ],
          "range": [
            {
              "field" : "timestamp",
              "range_width" : "> 604800000",
              "deny" : true
            }
          ]
        },
        "aggregation_types": {
          "variable_width_histogram: [
            {
              "num_buckets": "> 5"
              "deny" : true
            },
            { "allow":true}
          ]
        }
    }
  ]
}

We would make use of QueryBuilderVisitor to traverse the query tree. I'm imagining that we would essentially apply the rules sequentially to each visited query -- as soon as an "allow" rule matches, we proceed; if a "deny" rule matches, we fail with a 4xx error explaining what rule was violated.

I like this approach because the "rule language" can start simpler than what is described above and evolve based on need.

Describe alternatives you've considered
We could do something much smaller like turning search.allow_expensive_queries into an index-level setting without much code change. Even reworking it to act per-field wouldn't be a huge burden (though the current "allowExpensiveQuery" checks would need to be updated to check per-field). I don't see how it would be practical to implement something like the property-based checks or checks on aggregations.

Instead of using search pipelines, we could define a whole new API for query restriction rules (maybe in a plugin), but then we would need the full set of CRUD operations on these rules. Since this is logic that we would likely want to run on the coordinator, before fanning out to shards, it feels like search pipelines already provide a pretty good persistence API.

Additional context
Note that a SearchRequestProcessor that implements this functionality could be delivered via a plugin. It doesn't need to be core OpenSearch functionality. (It could theoretically start in a plugin and move into the search-pipeline-common module.)

The text was updated successfully, but these errors were encountered:

besha100 · 2023-11-06T14:22:13Z

Looks Great! Just have some thoughts:
1- can disabling-default-pipeline be disabled for users and the pipelines be enforced on the index template.
2- For the script processor , would it be possible to use terminate_after for the queries to terminate after not certain time but certain metrics like CPU, Memory?

macohen · 2023-11-06T16:50:42Z

Looks Great! Just have some thoughts:

1- can disabling-default-pipeline be disabled for users and the pipelines be enforced on the index template.

That seems like a very sane, necessary request. @msfroh and I briefly (and privately) discussed the idea that we might want to also set up an "invisible" search pipeline to handle all _search requests. In other words, even if a user didn't specify a pipeline, if you use this feature to protect your cluster, ALL search requests would need to go through this pipeline.

But then if a user DID specify a pipeline to use, we would need a way to ensure both this default pipeline and the user specified pipeline ran.

2 - I'll defer to others on this. I believe that Painless itself is sandboxed and may not have access to CPU and Memory, but the JVM does have access to these metrics. I don't know if there's a performance impact when accessing these metrics in a production environment these days...

msfroh · 2023-11-13T22:11:55Z

1- can disabling-default-pipeline be disabled for users and the pipelines be enforced on the index template.

As discussed on #6181 and #11053, I'm really excited about the idea of defining "views" that front e.g. an index pattern, where the view would enforce use of a search pipeline. Then you could grant users access to specific views. I think that would be a nice way to connect these various dots.

msfroh added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 26, 2023

msfroh mentioned this issue Nov 1, 2023

[Search Pipelines] Modify search pipeline behavior based on authenticated user/role #11053

Open

msfroh mentioned this issue Nov 6, 2023

Data projection with views #6181

Open

msfroh mentioned this issue Nov 13, 2023

Configurable query logging using search pipelines #11188

Open

macohen added Search:Resiliency and removed untriaged labels Nov 28, 2023

peternied mentioned this issue Feb 14, 2024

[Feature Request] Views features wishlist #12322

Closed

5 tasks

msfroh mentioned this issue Feb 20, 2024

Projected Views #11957

Merged

8 tasks

peternied mentioned this issue Feb 21, 2024

[RFC] View feature wishlist #12424

Open

5 tasks

msfroh mentioned this issue Apr 2, 2024

Qsb framework changes #13007

Closed

8 tasks

msfroh mentioned this issue Apr 23, 2024

[Feature Request][RFC] Multi-tenancy as a construct in OpenSearch #13341

Open

ansjcy mentioned this issue May 2, 2024

[META] Multi-tenancy support in OpenSearch #13516

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Search Pipelines] Add a processor that provides fine-grained control over what queries are allowed #10938

[Search Pipelines] Add a processor that provides fine-grained control over what queries are allowed #10938

msfroh commented Oct 26, 2023

besha100 commented Nov 6, 2023

macohen commented Nov 6, 2023

msfroh commented Nov 13, 2023

[Search Pipelines] Add a processor that provides fine-grained control over what queries are allowed #10938

[Search Pipelines] Add a processor that provides fine-grained control over what queries are allowed #10938

Comments

msfroh commented Oct 26, 2023

besha100 commented Nov 6, 2023

macohen commented Nov 6, 2023

msfroh commented Nov 13, 2023