Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add REGEXP_REPLACE function. #14460

Merged
merged 7 commits into from
Jun 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/querying/math-expr.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ The following built-in functions are available.
|parse_long|parse_long(string[, radix]) parses a string as a long with the given radix, or 10 (decimal) if a radix is not provided.|
|regexp_extract|regexp_extract(expr, pattern[, index]) applies a regular expression pattern and extracts a capture group index, or null if there is no match. If index is unspecified or zero, returns the substring that matched the pattern. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern.|
|regexp_like|regexp_like(expr, pattern) returns whether `expr` matches regular expression `pattern`. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern. |
|regexp_replace|regexp_replace(expr, pattern, replacement) replaces all instances of a regular expression pattern with a given replacement string. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern.|
|contains_string|contains_string(expr, string) returns whether `expr` contains `string` as a substring. This method is case-sensitive.|
|icontains_string|contains_string(expr, string) returns whether `expr` contains `string` as a substring. This method is case-insensitive.|
|replace|replace(expr, pattern, replacement) replaces pattern with replacement|
Expand Down
9 changes: 9 additions & 0 deletions docs/querying/sql-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -1141,6 +1141,15 @@ Applies a regular expression to the string expression and returns the _n_th matc

Returns true or false signifying whether the regular expression finds a match in the string expression.

## REGEXP_REPLACE

`REGEXP_REPLACE(<CHARACTER>, <CHARACTER>, <CHARACTER>)`

**Function type:** [Scalar, string](sql-scalar.md#string-functions)

Replaces all occurrences of a regular expression in a string expression with a replacement string. The replacement
string may refer to capture groups using `$1`, `$2`, etc.

## REPEAT

`REPEAT(<CHARACTER>, [<INTEGER>])`
Expand Down
1 change: 1 addition & 0 deletions docs/querying/sql-scalar.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ String functions accept strings, and return a type appropriate to the function.
|`POSITION(needle IN haystack [FROM fromIndex])`|Returns the index of `needle` within `haystack`, with indexes starting from 1. The search will begin at `fromIndex`, or 1 if `fromIndex` is not specified. If `needle` is not found, returns 0.|
|`REGEXP_EXTRACT(expr, pattern, [index])`|Apply regular expression `pattern` to `expr` and extract a capture group, or `NULL` if there is no match. If index is unspecified or zero, returns the first substring that matched the pattern. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern. Note: when `druid.generic.useDefaultValueForNull = true`, it is not possible to differentiate an empty-string match from a non-match (both will return `NULL`).|
|`REGEXP_LIKE(expr, pattern)`|Returns whether `expr` matches regular expression `pattern`. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern. Similar to [`LIKE`](sql-operators.md#logical-operators), but uses regexps instead of LIKE patterns. Especially useful in WHERE clauses.|
|`REGEXP_REPLACE(expr, pattern, replacement)`|Replaces all occurrences of regular expression `pattern` within `expr` with `replacement`. The replacement string may refer to capture groups using `$1`, `$2`, etc. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern.|
|`CONTAINS_STRING(expr, str)`|Returns true if the `str` is a substring of `expr`.|
|`ICONTAINS_STRING(expr, str)`|Returns true if the `str` is a substring of `expr`. The match is case-insensitive.|
|`REPLACE(expr, pattern, replacement)`|Replaces pattern with replacement in `expr`, and returns the result.|
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.apache.druid.query.expression;

import org.apache.druid.common.config.NullHandling;
import org.apache.druid.math.expr.Expr;
import org.apache.druid.math.expr.ExprEval;
import org.apache.druid.math.expr.ExprMacroTable;
import org.apache.druid.math.expr.ExpressionType;

import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexpReplaceExprMacro implements ExprMacroTable.ExprMacro
{
private static final String FN_NAME = "regexp_replace";

@Override
public String name()
{
return FN_NAME;
}

@Override
public Expr apply(final List<Expr> args)
{
validationHelperCheckArgumentCount(args, 3);

if (args.stream().skip(1).allMatch(Expr::isLiteral)) {
return new RegexpReplaceExpr(args);
} else {
return new RegexpReplaceDynamicExpr(args);
}
}

abstract class BaseRegexpReplaceExpr extends ExprMacroTable.BaseScalarMacroFunctionExpr
{
public BaseRegexpReplaceExpr(final List<Expr> args)
{
super(FN_NAME, args);
}

@Nullable
@Override
public ExpressionType getOutputType(InputBindingInspector inspector)
{
return ExpressionType.STRING;
}

@Override
public Expr visit(Shuttle shuttle)
{
return shuttle.visit(apply(shuttle.visitAll(args)));
}
}

/**
* Expr when pattern and replacement are literals.
*/
class RegexpReplaceExpr extends BaseRegexpReplaceExpr
{
private final Expr arg;
private final Pattern pattern;
private final String replacement;

private RegexpReplaceExpr(List<Expr> args)
{
super(args);

final Expr patternExpr = args.get(1);
final Expr replacementExpr = args.get(2);

if (!ExprUtils.isStringLiteral(patternExpr)
&& !(patternExpr.isLiteral() && patternExpr.getLiteralValue() == null)) {
throw validationFailed("pattern must be a string literal");
}

if (!ExprUtils.isStringLiteral(replacementExpr)
&& !(replacementExpr.isLiteral() && replacementExpr.getLiteralValue() == null)) {
throw validationFailed("replacement must be a string literal");
}

final String patternString = NullHandling.nullToEmptyIfNeeded((String) patternExpr.getLiteralValue());

this.arg = args.get(0);
this.pattern = patternString != null ? Pattern.compile(patternString) : null;

Check failure

Code scanning / CodeQL

Regular expression injection

This regular expression is constructed from a [user-provided value](1). This regular expression is constructed from a [user-provided value](2). This regular expression is constructed from a [user-provided value](3). This regular expression is constructed from a [user-provided value](4). This regular expression is constructed from a [user-provided value](5).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a real problem, but.. I'm not really sure the best way to solve the DoS problem since we do want them to be able to provide a pattern, and it can happen in other places where regex are provided by user queries too.

this.replacement = NullHandling.nullToEmptyIfNeeded((String) replacementExpr.getLiteralValue());
}

@Nonnull
@Override
public ExprEval<?> eval(final ObjectBinding bindings)
{
if (pattern == null || replacement == null) {
return ExprEval.of(null);
}

final String s = NullHandling.nullToEmptyIfNeeded(arg.eval(bindings).asString());

if (s == null) {
return ExprEval.of(null);
} else {
final Matcher matcher = pattern.matcher(s);
final String retVal = matcher.replaceAll(replacement);
return ExprEval.of(retVal);
}
}
}

/**
* Expr when pattern and replacement are dynamic (not literals).
*/
class RegexpReplaceDynamicExpr extends BaseRegexpReplaceExpr
{
private RegexpReplaceDynamicExpr(List<Expr> args)
{
super(args);
}

@Nonnull
@Override
public ExprEval<?> eval(final ObjectBinding bindings)
{
final String s = NullHandling.nullToEmptyIfNeeded(args.get(0).eval(bindings).asString());
final String pattern = NullHandling.nullToEmptyIfNeeded(args.get(1).eval(bindings).asString());
final String replacement = NullHandling.nullToEmptyIfNeeded(args.get(2).eval(bindings).asString());

if (s == null || pattern == null || replacement == null) {
return ExprEval.of(null);
} else {
final Matcher matcher = Pattern.compile(pattern).matcher(s);

Check failure

Code scanning / CodeQL

Regular expression injection

This regular expression is constructed from a [user-provided value](1). This regular expression is constructed from a [user-provided value](2). This regular expression is constructed from a [user-provided value](3).
final String retVal = matcher.replaceAll(replacement);
return ExprEval.of(retVal);
}
}
}
}
Loading