Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Base64 Encode/Decode Scalar Functions #9114

Merged
merged 14 commits into from
Aug 2, 2022

Conversation

SabrinaZhaozyf
Copy link
Contributor

@SabrinaZhaozyf SabrinaZhaozyf commented Jul 27, 2022

Label = feature

This PR implements toBase64(byte[]) and fromBase64(str) encoding and decoding functionality.

  • Encoding scheme: encodes input using the Base64 encoding scheme as specified in RFC 4648 and RFC 2045
  • Decoding will throw an exception if invalid base64 string is passed in.

Use Case Examples

  • Encode binary data: toBase64(binaryCol)
  • Encode strings in UTF8: toBase64(toUtf8('hello')) , toBase64(toUtf8(stringCol))
  • Decode base64-encoded strings to binary data: fromBase64('aGVsbG8h'), fromBase64(stringCol)
  • Decode base64-encoded strings to UTF8 charset: fromUtf8(fromBase64('aGVsbG8h'))

@SabrinaZhaozyf SabrinaZhaozyf changed the title Added Base64 Encoding/Decoding Scalar Functions [WIP] Added Base64 Encoding/Decoding Scalar Functions Jul 27, 2022
@amrishlal
Copy link
Contributor

In the description, can we add a short overview of the issue being addressed along with a summary of changes.

@siddharthteotia siddharthteotia marked this pull request as ready for review July 27, 2022 21:15
@SabrinaZhaozyf SabrinaZhaozyf changed the title [WIP] Added Base64 Encoding/Decoding Scalar Functions Added Base64 Encoding/Decoding Scalar Functions Jul 27, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2022

Codecov Report

Merging #9114 (1d0037e) into master (2676ffa) will decrease coverage by 3.00%.
The diff coverage is 38.76%.

❗ Current head 1d0037e differs from pull request most recent head ef0f3d2. Consider uploading reports for the commit ef0f3d2 to get more accurate results

@@             Coverage Diff              @@
##             master    #9114      +/-   ##
============================================
- Coverage     69.96%   66.96%   -3.01%     
- Complexity     4749     4800      +51     
============================================
  Files          1838     1378     -460     
  Lines         97350    70889   -26461     
  Branches      14672    11211    -3461     
============================================
- Hits          68113    47469   -20644     
+ Misses        24495    20018    -4477     
+ Partials       4742     3402    -1340     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 66.96% <38.76%> (+0.06%) ⬆️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ava/org/apache/pinot/common/config/GrpcConfig.java 0.00% <0.00%> (ø)
...ache/pinot/common/metadata/ZKMetadataProvider.java 15.16% <0.00%> (-52.07%) ⬇️
...mmon/metadata/controllerjob/ControllerJobType.java 0.00% <0.00%> (ø)
...n/java/org/apache/pinot/common/utils/TlsUtils.java 3.57% <0.00%> (-72.78%) ⬇️
...pache/pinot/common/utils/grpc/GrpcQueryClient.java 0.00% <0.00%> (-75.48%) ⬇️
.../pinot/core/data/manager/BaseTableDataManager.java 73.22% <ø> (-0.34%) ⬇️
...ata/manager/realtime/RealtimeTableDataManager.java 11.32% <0.00%> (-55.48%) ⬇️
...re/operator/StreamingInstanceResponseOperator.java 0.00% <0.00%> (-76.93%) ⬇️
...t/core/plan/StreamingInstanceResponsePlanNode.java 0.00% <0.00%> (-100.00%) ⬇️
...che/pinot/core/transport/grpc/GrpcQueryServer.java 0.00% <0.00%> (-51.32%) ⬇️
... and 742 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

*/
@ScalarFunction
public static String binaryToBase64(byte[] input) {
return Base64.getEncoder().encodeToString(input);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expectation on charset of input bytes / binary data ? Looks like we assume it is UTF8. What if the input bytes provided here was encoded using UTF16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved - we will not have a fixed charset of input bytes now with the new implementation. For example, if user wants to encode the String "hello!", he/she can choose to do toBase64(toUtf8("hello")) or toBase64(toUtf16("hello")) (Pinot currently only supports toUtf8 and toASCII functions). During decoding, he/she can do fromUtf8(fromBase64("fvduivheui")) to fromUtf16(fromBase64("fvduivheui")) to get "hello!" depending on if they used UTF8 or UTF16 when they encoded.

Copy link
Contributor

@jasperjiaguo jasperjiaguo Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we error if the user passes a string to the toBase64 function, or some implicit type casting will happen? MySQL users might not expect the syntax of toBase64(toUtf8("hello")). Similarly, if the user calls on fromBase64('aGVsbG8h'), they might not expecting us to return hex string. We should at least call this out in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved - added tests for calling toBase64 on a string. Will add in documentation that BYTES col will be represented in HEX string(also link:https://docs.pinot.apache.org/users/user-guide-query/querying-pinot#bytes-column).

* @return decoded string
*/
@ScalarFunction
public static String fromBase64(String input) {
Copy link
Contributor

@siddharthteotia siddharthteotia Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are providing a toBase64 with a user specified charset encoding and a default utf8 charset flavor then shouldn't we also provide fromBase64 in both flavors as well ?

What if someone calls toBase64("blahblah", "UTF16") to get "something" and then fromBase64("something") - this will convert "something" to bytes and then to String using UTF8 which can be wrong because "something" was created from "blahblah" using UTF16.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved - fromBase64 will decode into byte[] (the layer from byte[] to String is removed now) and the user can decide to encode the byte[] to UTF8 strings or other charsets by calling other functions such as fromUtf8.

@siddharthteotia siddharthteotia changed the title Added Base64 Encoding/Decoding Scalar Functions Add support for Base64 Encode/Decode Scalar Functions Jul 29, 2022
Copy link
Contributor

@siddharthteotia siddharthteotia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution

@siddharthteotia
Copy link
Contributor

@SabrinaZhaozyf - please update the PR description to reflect the latest discussion / approach

Copy link
Contributor

@jasperjiaguo jasperjiaguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor questions regarding invalid input / unexpected usage.

*/
@ScalarFunction
public static String binaryToBase64(byte[] input) {
return Base64.getEncoder().encodeToString(input);
Copy link
Contributor

@jasperjiaguo jasperjiaguo Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we error if the user passes a string to the toBase64 function, or some implicit type casting will happen? MySQL users might not expect the syntax of toBase64(toUtf8("hello")). Similarly, if the user calls on fromBase64('aGVsbG8h'), they might not expecting us to return hex string. We should at least call this out in the documentation.

@siddharthteotia siddharthteotia merged commit 913f0d8 into apache:master Aug 2, 2022
@siddharthteotia
Copy link
Contributor

@SabrinaZhaozyf - let's user docs to gitbook please

@npawar
Copy link
Contributor

npawar commented Aug 27, 2022

@SabrinaZhaozyf - let's user docs to gitbook please

Hello, was this added into the docs? I cannot find it in https://docs.pinot.apache.org/configuration-reference/functions

@SabrinaZhaozyf
Copy link
Contributor Author

@npawar Changes to user doc has been merged here (pinot-contrib/pinot-docs#109). Not sure why it hasn't been updated on documentation yet.

@npawar
Copy link
Contributor

npawar commented Aug 29, 2022

@npawar Changes to user doc has been merged here (pinot-contrib/pinot-docs#109). Not sure why it hasn't been updated on documentation yet.

Oh i see. lemme check on that

@siddharthteotia
Copy link
Contributor

@npawar - User docs were added some time ago by @SabrinaZhaozyf

Just linking them here for reference - https://docs.pinot.apache.org/configuration-reference/functions/base64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants