Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add REGEXP_REPLACE function. #14460

Merged
merged 7 commits into from
Jun 29, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/querying/math-expr.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ The following built-in functions are available.
|parse_long|parse_long(string[, radix]) parses a string as a long with the given radix, or 10 (decimal) if a radix is not provided.|
|regexp_extract|regexp_extract(expr, pattern[, index]) applies a regular expression pattern and extracts a capture group index, or null if there is no match. If index is unspecified or zero, returns the substring that matched the pattern. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern.|
|regexp_like|regexp_like(expr, pattern) returns whether `expr` matches regular expression `pattern`. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern. |
|regexp_replace|regexp_replace(expr, pattern, replacement) replaces all instances of a regular expression pattern with a given replacement string. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern.|
|contains_string|contains_string(expr, string) returns whether `expr` contains `string` as a substring. This method is case-sensitive.|
|icontains_string|contains_string(expr, string) returns whether `expr` contains `string` as a substring. This method is case-insensitive.|
|replace|replace(expr, pattern, replacement) replaces pattern with replacement|
Expand Down
9 changes: 9 additions & 0 deletions docs/querying/sql-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -1141,6 +1141,15 @@ Applies a regular expression to the string expression and returns the _n_th matc

Returns true or false signifying whether the regular expression finds a match in the string expression.

## REGEXP_REPLACE

`REGEXP_REPLACE(<CHARACTER>, <CHARACTER>, <CHARACTER>)`

**Function type:** [Scalar, string](sql-scalar.md#string-functions)

Replaces all occurrences of a regular expression in a string expression with a replacement string. The replacement
string may refer to capture groups using `$1`, `$2`, etc.

## REPEAT

`REPEAT(<CHARACTER>, [<INTEGER>])`
Expand Down
1 change: 1 addition & 0 deletions docs/querying/sql-scalar.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ String functions accept strings, and return a type appropriate to the function.
|`POSITION(needle IN haystack [FROM fromIndex])`|Returns the index of `needle` within `haystack`, with indexes starting from 1. The search will begin at `fromIndex`, or 1 if `fromIndex` is not specified. If `needle` is not found, returns 0.|
|`REGEXP_EXTRACT(expr, pattern, [index])`|Apply regular expression `pattern` to `expr` and extract a capture group, or `NULL` if there is no match. If index is unspecified or zero, returns the first substring that matched the pattern. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern. Note: when `druid.generic.useDefaultValueForNull = true`, it is not possible to differentiate an empty-string match from a non-match (both will return `NULL`).|
|`REGEXP_LIKE(expr, pattern)`|Returns whether `expr` matches regular expression `pattern`. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern. Similar to [`LIKE`](sql-operators.md#logical-operators), but uses regexps instead of LIKE patterns. Especially useful in WHERE clauses.|
|`REGEXP_REPLACE(expr, pattern, replacement)`|Replaces all occurrences of regular expression `pattern` within `expr` with `replacement`. The replacement string may refer to capture groups using `$1`, `$2`, etc. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern.|
|`CONTAINS_STRING(expr, str)`|Returns true if the `str` is a substring of `expr`.|
|`ICONTAINS_STRING(expr, str)`|Returns true if the `str` is a substring of `expr`. The match is case-insensitive.|
|`REPLACE(expr, pattern, replacement)`|Replaces pattern with replacement in `expr`, and returns the result.|
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.apache.druid.query.expression;

import org.apache.druid.common.config.NullHandling;
import org.apache.druid.math.expr.Expr;
import org.apache.druid.math.expr.ExprEval;
import org.apache.druid.math.expr.ExprMacroTable;
import org.apache.druid.math.expr.ExpressionType;

import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexpReplaceExprMacro implements ExprMacroTable.ExprMacro
{
private static final String FN_NAME = "regexp_replace";

@Override
public String name()
{
return FN_NAME;
}

@Override
public Expr apply(final List<Expr> args)
{
validationHelperCheckArgumentCount(args, 3);

if (args.stream().skip(1).allMatch(Expr::isLiteral)) {
return new RegexpReplaceExpr(args);
} else {
return new RegexpReplaceDynamicExpr(args);
}
}

abstract class BaseRegexpReplaceExpr extends ExprMacroTable.BaseScalarMacroFunctionExpr
{
public BaseRegexpReplaceExpr(final List<Expr> args)
{
super(FN_NAME, args);
}

@Nullable
@Override
public ExpressionType getOutputType(InputBindingInspector inspector)
{
return ExpressionType.STRING;
}

@Override
public Expr visit(Shuttle shuttle)
{
return shuttle.visit(apply(shuttle.visitAll(args)));
}
}

/**
* Expr when pattern and replacement are literals.
*/
class RegexpReplaceExpr extends BaseRegexpReplaceExpr
{
private final Expr arg;
private final Pattern pattern;
private final String replacement;

private RegexpReplaceExpr(List<Expr> args)
{
super(args);

final Expr patternExpr = args.get(1);
final Expr replacementExpr = args.get(2);

if (!ExprUtils.isStringLiteral(patternExpr)
|| NullHandling.nullToEmptyIfNeeded((String) patternExpr.getLiteralValue()) == null) {
throw validationFailed("pattern must be a string literal");
}

if (!replacementExpr.isLiteral()
|| NullHandling.nullToEmptyIfNeeded((String) replacementExpr.getLiteralValue()) == null) {
throw validationFailed("index must be a string literal");
}

this.arg = args.get(0);
this.pattern =
Pattern.compile(NullHandling.nullToEmptyIfNeeded((String) patternExpr.getLiteralValue()));
Fixed Show fixed Hide fixed
this.replacement = NullHandling.nullToEmptyIfNeeded((String) replacementExpr.getLiteralValue());
}

@Nonnull
@Override
public ExprEval<?> eval(final ObjectBinding bindings)
{
final String s = NullHandling.nullToEmptyIfNeeded(arg.eval(bindings).asString());

if (s == null) {
return ExprEval.of(null);
} else {
final Matcher matcher = pattern.matcher(s);
final String retVal = matcher.replaceAll(replacement);
return ExprEval.of(retVal);
}
}
}

/**
* Expr when pattern and replacement are dynamic (not literals).
*/
class RegexpReplaceDynamicExpr extends BaseRegexpReplaceExpr
{
private RegexpReplaceDynamicExpr(List<Expr> args)
{
super(args);
}

@Nonnull
@Override
public ExprEval<?> eval(final ObjectBinding bindings)
{
final String s = NullHandling.nullToEmptyIfNeeded(args.get(0).eval(bindings).asString());
final String pattern = NullHandling.nullToEmptyIfNeeded(args.get(1).eval(bindings).asString());
final String replacement = NullHandling.nullToEmptyIfNeeded(args.get(2).eval(bindings).asString());

if (s == null || pattern == null || replacement == null) {
return ExprEval.of(null);
} else {
final Matcher matcher = Pattern.compile(pattern).matcher(s);

Check failure

Code scanning / CodeQL

Regular expression injection

This regular expression is constructed from a [user-provided value](1). This regular expression is constructed from a [user-provided value](2). This regular expression is constructed from a [user-provided value](3).
final String retVal = matcher.replaceAll(replacement);
return ExprEval.of(retVal);
}
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.apache.druid.query.expression;

import org.apache.druid.common.config.NullHandling;
import org.apache.druid.math.expr.ExprEval;
import org.apache.druid.math.expr.ExpressionType;
import org.apache.druid.math.expr.InputBindings;
import org.junit.Assert;
import org.junit.Test;

public class RegexpReplaceExprMacroTest extends MacroTestBase
{
public RegexpReplaceExprMacroTest()
{
super(new RegexpReplaceExprMacro());
}

@Test
public void testErrorZeroArguments()
{
expectException(IllegalArgumentException.class, "Function[regexp_replace] requires 3 arguments");
eval("regexp_replace()", InputBindings.nilBindings());
}

@Test
public void testErrorFourArguments()
{
expectException(IllegalArgumentException.class, "Function[regexp_replace] requires 3 arguments");
eval("regexp_replace('a', 'b', 'c', 'd')", InputBindings.nilBindings());
}

@Test
public void testErrorNonStringPattern()
{
expectException(IllegalArgumentException.class, "Function[regexp_replace] pattern must be a string literal");
eval(
"regexp_replace(a, 1, 'x')",
InputBindings.forInputSupplier("a", ExpressionType.STRING, () -> "foo")
);
}

@Test
public void testErrorNullPattern()
{
if (NullHandling.sqlCompatible()) {
expectException(
IllegalArgumentException.class,
"Function[regexp_replace] pattern must be a nonnull string literal"
);
}

final ExprEval<?> result = eval(
"regexp_replace(a, null, 'x')",
InputBindings.forInputSupplier("a", ExpressionType.STRING, () -> "foo")
);

// SQL-compat should have thrown an error by now.
Assert.assertTrue(NullHandling.replaceWithDefault());
Assert.assertEquals("xfxoxox", result.value());
}

@Test
public void testNoMatch()
{
final ExprEval<?> result = eval(
"regexp_replace(a, 'f.x', 'beep')",
InputBindings.forInputSupplier("a", ExpressionType.STRING, () -> "foo")
);
Assert.assertEquals("foo", result.value());
}

@Test
public void testEmptyStringPattern()
{
final ExprEval<?> result = eval(
"regexp_replace(a, '', 'x')",
InputBindings.forInputSupplier("a", ExpressionType.STRING, () -> "foo")
);
Assert.assertEquals("xfxoxox", result.value());
}

@Test
public void testMultiLinePattern()
{
final ExprEval<?> result = eval(
"regexp_replace(a, '^foo\\\\nbar$', 'xxx')",
InputBindings.forInputSupplier("a", ExpressionType.STRING, () -> "foo\nbar")
);
Assert.assertEquals("xxx", result.value());
}

@Test
public void testMultiLinePatternNoMatch()
{
final ExprEval<?> result = eval(
"regexp_replace(a, '^foo\\\\nbar$', 'xxx')",
InputBindings.forInputSupplier("a", ExpressionType.STRING, () -> "foo\nbarz")
);
Assert.assertEquals("foo\nbarz", result.value());
}

@Test
public void testNullPatternOnEmptyString()
{
if (NullHandling.sqlCompatible()) {
expectException(IllegalArgumentException.class, "Function[regexp_replace] pattern must be a STRING literal");
}

final ExprEval<?> result = eval(
"regexp_replace(a, null, 'x')",
InputBindings.forInputSupplier("a", ExpressionType.STRING, () -> "")
);

// SQL-compat should have thrown an error by now.
Assert.assertTrue(NullHandling.replaceWithDefault());
Assert.assertEquals("x", result.value());
}

@Test
public void testEmptyStringPatternOnEmptyString()
{
final ExprEval<?> result = eval(
"regexp_replace(a, '', 'x')",
InputBindings.forInputSupplier("a", ExpressionType.STRING, () -> "")
);
Assert.assertEquals("x", result.value());
}

@Test
public void testNullPatternOnNull()
{
if (NullHandling.sqlCompatible()) {
expectException(
IllegalArgumentException.class,
"Function[regexp_replace] pattern must be a nonnull string literal"
);
}

final ExprEval<?> result = eval("regexp_replace(a, null, 'x')", InputBindings.nilBindings());

// SQL-compat should have thrown an error by now.
Assert.assertTrue(NullHandling.replaceWithDefault());
Assert.assertEquals("x", result.value());
}

@Test
public void testEmptyStringPatternOnNull()
{
final ExprEval<?> result = eval("regexp_replace(a, '', 'x')", InputBindings.nilBindings());

if (NullHandling.sqlCompatible()) {
Assert.assertNull(result.value());
} else {
Assert.assertEquals("x", result.value());
}
}

@Test
public void testUrlIdReplacement()
{
final ExprEval<?> result = eval(
"regexp_replace(regexp_replace(a, '\\\\?(.*)$', ''), '/(\\\\w+)(?=/|$)', '/*')",
InputBindings.forInputSupplier("a", ExpressionType.STRING, () -> "http://example.com/path/to?query")
);

if (NullHandling.sqlCompatible()) {
Assert.assertNull(result.value());
} else {
Assert.assertEquals("http://example.com/*/*", result.value());
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,9 @@ private TestExprMacroTable(ObjectMapper jsonMapper)
new IPv4AddressParseExprMacro(),
new IPv4AddressStringifyExprMacro(),
new LikeExprMacro(),
new RegexpLikeExprMacro(),
new RegexpExtractExprMacro(),
new RegexpReplaceExprMacro(),
new TimestampCeilExprMacro(),
new TimestampExtractExprMacro(),
new TimestampFloorExprMacro(),
Expand Down
Loading