Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(expr): add to_jsonb #13161

Merged
merged 7 commits into from
Nov 3, 2023
Merged

feat(expr): add to_jsonb #13161

merged 7 commits into from
Nov 3, 2023

Conversation

KeXiangWang
Copy link
Contributor

@KeXiangWang KeXiangWang commented Oct 31, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

As title.
#12834

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

to_jsonb (any) → jsonb
Extracts convert any type to jsonb. Arrays and composites are converted recursively to arrays and objects. For any scalar other than a number, a Boolean, or a null value, the text representation will be used, with escaping as necessary to make it a valid JSON string value.
to_jsonb(row(42, 'Fred said "Hi."'::text)) → {"f1": 42, "f2": "Fred said \"Hi.\""}

@KeXiangWang
Copy link
Contributor Author

Additional notes:

  1. Currently, a float like 1.00 will be printed as 1 in PG, but 1.0 in RW. For example:
select to_jsonb(1.0000::float);

# PG:
 to_jsonb
----------
 1

# RW:
 to_jsonb 
----------
 1.0

This requires further investigation.
2. Support to_jsonb with the correct time zone requires a more complicated implementation. I will open another PR to finish it.
3. Decimal is first tried to convert f64 to fit in JSON number; if it fails, convert to string. I also tried with PG, PG seems support a Jsonb number out of the range of IEEE 754 double:

SELECT to_jsonb(9.9999999999999999999999999999::decimal);
            to_jsonb
--------------------------------
 9.9999999999999999999999999999
(1 row)
  1. int256 is RW only, so I directly convert it to string.

@KeXiangWang KeXiangWang marked this pull request as draft October 31, 2023 04:25
@wangrunji0408
Copy link
Contributor

Currently, a float like 1.00 will be printed as 1 in PG, but 1.0 in RW.

This seems to require a custom formatter in the jsonbb library.

PG seems support a Jsonb number out of the range of IEEE 754 double.

This also requires arbitrary precision number support in the jsonbb library.

int256 is RW only, so I directly convert it to string.

This can also be stored as an arbitrary precision number. Converting to string will result in extra "".

I'll take on them.

@xiangjinwu
Copy link
Contributor

xiangjinwu commented Oct 31, 2023

  1. Currently, a float like 1.00 will be printed as 1 in PG, but 1.0 in RW. For example:

#6412 Formatting of float is more complicated than we thought. Even for cast(float) -> varchar we are just using the same ryu algorithm but not the same config parameters yet.

  1. Support to_jsonb with the correct time zone requires a more complicated implementation. I will open another PR to finish it.

#7175 (comment) Similar problem is not solved for cast(timestamptz[]) -> varchar as well. For now we can always output in UTC and it would be easier to fix after we migrate timezone handling to #12747

  1. Decimal is first tried to convert f64 to fit in JSON number; if it fails, convert to string. I also tried with PG, PG seems support a Jsonb number out of the range of IEEE 754 double:
  2. int256 is RW only, so I directly convert it to string.

I would suggest we just report an out-of-range error for such case, if doable. RFC 8259 acknowledges the fact that only IEEE 754 double has good interoperability. PostgreSQL jsonb uses its variable length decimal (aka numeric), which is much larger than our existing decimal. It does not have to be a blocker for the common use cases where numbers are within the range, before we support the larger range. Switching from an error to a number is not considered a breaking change, but switching from a string is.

@KeXiangWang KeXiangWang changed the title feat(expr): add to_json feat(expr): add to_jsonb Nov 1, 2023
@stdrc stdrc self-requested a review November 1, 2023 03:08
src/expr/impl/src/scalar/to_jsonb.rs Outdated Show resolved Hide resolved
let mut builder = jsonbb::Builder::default();
builder.begin_object();

let names: Vec<&str> = data_type.as_struct().names().collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also zip the names to avoid collection to vec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the names vec is empty, for instance:

select to_jsonb(ARRAY[3,4,5,6]);

Any suggestion for this situation regarding to zip?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that for unnamed fields, pg always names them as f1, f2, .... Then I think we should code this behavior in StructType and make names() always return non-empty.

postgres=# select (row(1, 2)).*;
 f1 | f2 
----+----
  1 |  2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems code is also using names().

Let me double check and fix it in the jsonb_agg enhancement PR.

src/expr/impl/src/scalar/to_jsonb.rs Outdated Show resolved Hide resolved
src/expr/impl/src/scalar/to_jsonb.rs Outdated Show resolved Hide resolved
@KeXiangWang KeXiangWang marked this pull request as ready for review November 2, 2023 05:58
@KeXiangWang
Copy link
Contributor Author

Still have some issues:

  1. For Decimal, for now we convert it to f64 (to str if it's inf/-inf/nan). It may lose some precision. Currently we cannot find a good way to convert to f64 without precision lose. Because f64 conversion considers precision lose as acceptable. For example, f64::from(9.999999999900999) == 10.0 is expected. Please let me know if you have any ideas.
  2. Int256 face the similar problem. But I think currently we can try convert it to i64, return an error if it fails, as Xiangjin suggested.
  3. Now one small problem is that the message NOTICE: Your session timezone is UTC. is shown for all types of array and struct. For example, even to_jsonb(array[1, 2,3]) will produce this message. This should be avoided.

I'll take a look tomorrow to fix 2 and 3.

Copy link

codecov bot commented Nov 2, 2023

Codecov Report

Merging #13161 (c2e6447) into main (57f75da) will decrease coverage by 0.04%.
The diff coverage is 1.44%.

@@            Coverage Diff             @@
##             main   #13161      +/-   ##
==========================================
- Coverage   68.06%   68.03%   -0.04%     
==========================================
  Files        1515     1516       +1     
  Lines      257111   257257     +146     
==========================================
+ Hits       175013   175028      +15     
- Misses      82098    82229     +131     
Flag Coverage Δ
rust 68.03% <1.44%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/frontend/src/binder/expr/function.rs 78.30% <100.00%> (+0.01%) ⬆️
src/frontend/src/expr/pure.rs 87.69% <ø> (ø)
src/common/src/types/jsonb.rs 37.75% <50.00%> (+4.05%) ⬆️
src/expr/impl/src/aggregate/jsonb_agg.rs 0.00% <0.00%> (ø)
src/expr/impl/src/scalar/to_jsonb.rs 0.00% <0.00%> (ø)

... and 3 files with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

Copy link
Contributor

@wangrunji0408 wangrunji0408 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks

src/common/src/types/jsonb.rs Outdated Show resolved Hide resolved
src/common/src/types/jsonb.rs Outdated Show resolved Hide resolved
let mut builder = jsonbb::Builder::default();
builder.begin_object();

let names: Vec<&str> = data_type.as_struct().names().collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that for unnamed fields, pg always names them as f1, f2, .... Then I think we should code this behavior in StructType and make names() always return non-empty.

postgres=# select (row(1, 2)).*;
 f1 | f2 
----+----
  1 |  2

Copy link
Contributor

@stdrc stdrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiangjinwu
Copy link
Contributor

To summarize:

  • bool
    • converts to jsonb bool trivially
  • int16 / int32
    • converts to jsonb number trivially
  • float32 / float64
    • converts to jsonb number except for nan / inf / -inf
  • int64 / decimal / int256
    • may not fit into IEEE 754 double, but PostgreSQL uses number (backed by decimal); we can report out-of-range error
  • varchar
    • converts to jsonb string trivially
  • interval / date / time / bytea
    • converts to jsonb string using ToText
    • bytea does not leverage From<&[u8]> for Value due to conflicting behavior, or the orphan rule
  • timestamp
    • converts to jsonb string using a new format, with T as separator
  • timestamptz
    • converts to jsonb string using a new format and requires implicit session timezone
  • jsonb
    • noop From<T> for T
  • list
    • recursive with special care given to bytea and timestamptz
  • struct
    • recursive with special care given to bytea and timestamptz
  • serial
    • currently jsonb string using ToText, but we provide no semantic guarantee on it and may change it freely in the future

Some suggestions:

  • Given the implicit session timezone handling is going to be refactored soon, we may choose not to support it here the old way. Practically it is acceptable to only support 2006-01-02T22:04:05Z and the uses of 2006-01-02T15:04:05-07:00 in json are rare.
  • Rather than From<T> for Value we can also do ToJsonb for T. By using a dedicated new trait we avoid the bytea conflict problem. In context other than the to_jsonb series of functions, there may other preferred way to convert a date or an int64 to json value. For example, a date can be encoded as number of days since 1970-01-01 (in debezium sink), or an int64 can be encoded as a string for better interoperability (as defined by proto3). From<T> for Value can be too general. Furthermore, we already have a trait definition here.
  • As shown above, we do not want to provide any guarantee on serial. Today it actually contains an integer - but I am not saying we should convert it to a jsonb number. A string is more future-proof. And according to PostgreSQL it should just use text representation as everything else.
test=# select xmin, pg_typeof(xmin), to_jsonb(xmin) from test;
 xmin | pg_typeof | to_jsonb 
------+-----------+----------
  746 | xid       | "746"
(1 row)

@wangrunji0408
Copy link
Contributor

bytea does not leverage From<&[u8]> for Value due to conflicting behavior, or the orphan rule

Maybe it's time to introduce the Bytea & ByteaRef type? Functions like str_to_bytea can be collected into this new type.

Furthermore, we already have a trait definition here.

Right. We should unify these traits. The other remaining task is to support any type as the argument of jsonb_agg.

@KeXiangWang
Copy link
Contributor Author

KeXiangWang commented Nov 2, 2023

introduce the Bytea & ByteaRef type

Agree. Let's do it in another PR.

We should unify these traits.

Yes. I'm now trying to take use of the ToJson trait to refactor these codes.

@KeXiangWang
Copy link
Contributor Author

KeXiangWang commented Nov 2, 2023

Update:

  1. int64 / decimal / int256 will be converted and expressed in F64 (except for nan / inf / -inf)
  2. Removed the implementation of timestamptz in time zone, which will be handled be future refactoring soon.
  3. Created one issue to track the float formating problem mentioned above.

The other remaining task is to support any type as the argument of jsonb_agg.

  1. Will make a new PR to do that.

Copy link
Contributor

@xiangjinwu xiangjinwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly lgtm

Comment on lines +173 to +177
impl ToJsonb for i64 {
fn add_to(self, builder: &mut Builder) -> Result<()> {
let res: F64 = self
.try_into()
.map_err(|_| ExprError::CastOutOfRange("IEEE 754 double"))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sharing more on this, given the e2e test failure in CI:

  • RFC 8259 allows this: type Number = f64.
  • PostgreSQL does this: type Number = Decimal.
    • PostgreSQL decimal can even hold f64::MAX
  • serde_json/simd-json/jsonbb does this: enum Number {I64, U64, F64}
    • serde_json has a feature arbitrary_precision to type Number = String
    • simd-json has a feature 128bit to enum Number {I64, U64, F64, I128, U128}
    • jsonbb may be extended to enum Number {I64, U64, F64, Decimal}
      • This requires a variable length decimal. rust_decimal is not enough.

Back to to_jsonb: we do have the capability to avoid precision loss on i64 today, but using the lossy f64 feels more consistent with decimal / int256 which we cannot encode losslessly yet. Both sounds acceptable to me. And the .0 issue would cause less problems after the jsonbb serializer fix.

e2e_test/batch/basic/to_jsonb.slt.part Outdated Show resolved Hide resolved
e2e_test/batch/basic/to_jsonb.slt.part Outdated Show resolved Hide resolved
e2e_test/batch/basic/to_jsonb.slt.part Outdated Show resolved Hide resolved
e2e_test/batch/basic/to_jsonb.slt.part Outdated Show resolved Hide resolved
e2e_test/batch/basic/to_jsonb.slt.part Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@fuyufjh
Copy link
Member

fuyufjh commented Nov 3, 2023

int64 / decimal / int256
may not fit into IEEE 754 double, but PostgreSQL uses number (backed by decimal); we can report out-of-range error

Tips: We don't have anyway to report out-of-range error during streaming. Instead, the whole JSON field will be filled with null and a warning log will be printed.

I would recommend to just (implicitly) convert int64 / decimal to float64, regardless of the precision loss, which looks intuitive to me. For int256, string might be better because it's majorly designed for blockchain address, which usually use all the 256-bits.

@xiangjinwu
Copy link
Contributor

int64 / decimal / int256
may not fit into IEEE 754 double, but PostgreSQL uses number (backed by decimal); we can report out-of-range error

Tips: We don't have anyway to report out-of-range error during streaming. Instead, the whole JSON field will be filled with null.

I would recommend to just (implicitly) convert int64 / decimal to float64, regardless of the precision loss, which looks intuitive to me. For int256, string might be better because it's majorly designed for blockchain address, which usually use all the 256-bits.

  • There are two types of errors here: precision loss and out-of-range.
    • There is no From<i64> for f64 in rust because it can cause precision loss. But i64::MAX < f64::MAX and we did allow the loss following PostgreSQL and added From<i64> for OrderedFloat<f64>.
    • Out-of-range is for values greater than f64::MAX. I should not have mentioned it in this issue, as it is super large and covers all of our int64 / decimal / int256. It was me that confused these two errors at the beginning. f64 out of range is only possible in PostgreSQL decimal 1e309. That is, practically, the out-of-range error in this PR is unreachable.
  • Good point on int256. So let's treat it as "everything else uses text representation" rather than in the number category.

@KeXiangWang KeXiangWang added this pull request to the merge queue Nov 3, 2023
Merged via the queue into main with commit 624094c Nov 3, 2023
29 of 30 checks passed
@KeXiangWang KeXiangWang deleted the wkx/to_jsonb branch November 3, 2023 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants