feat(expr): add to_jsonb #13161

KeXiangWang · 2023-10-31T01:43:56Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

As title.
#12834

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

to_jsonb (any) → jsonb
Extracts convert any type to jsonb. Arrays and composites are converted recursively to arrays and objects. For any scalar other than a number, a Boolean, or a null value, the text representation will be used, with escaping as necessary to make it a valid JSON string value.
to_jsonb(row(42, 'Fred said "Hi."'::text)) → {"f1": 42, "f2": "Fred said \"Hi.\""}

KeXiangWang · 2023-10-31T01:51:36Z

Additional notes:

Currently, a float like 1.00 will be printed as 1 in PG, but 1.0 in RW. For example:

select to_jsonb(1.0000::float);

# PG:
 to_jsonb
----------
 1

# RW:
 to_jsonb 
----------
 1.0

This requires further investigation.
2. Support to_jsonb with the correct time zone requires a more complicated implementation. I will open another PR to finish it.
3. Decimal is first tried to convert f64 to fit in JSON number; if it fails, convert to string. I also tried with PG, PG seems support a Jsonb number out of the range of IEEE 754 double:

SELECT to_jsonb(9.9999999999999999999999999999::decimal);
            to_jsonb
--------------------------------
 9.9999999999999999999999999999
(1 row)

int256 is RW only, so I directly convert it to string.

wangrunji0408 · 2023-10-31T04:54:00Z

Currently, a float like 1.00 will be printed as 1 in PG, but 1.0 in RW.

This seems to require a custom formatter in the jsonbb library.

PG seems support a Jsonb number out of the range of IEEE 754 double.

This also requires arbitrary precision number support in the jsonbb library.

int256 is RW only, so I directly convert it to string.

This can also be stored as an arbitrary precision number. Converting to string will result in extra "".

I'll take on them.

xiangjinwu · 2023-10-31T08:48:59Z

Currently, a float like 1.00 will be printed as 1 in PG, but 1.0 in RW. For example:

#6412 Formatting of float is more complicated than we thought. Even for cast(float) -> varchar we are just using the same ryu algorithm but not the same config parameters yet.

Support to_jsonb with the correct time zone requires a more complicated implementation. I will open another PR to finish it.

#7175 (comment) Similar problem is not solved for cast(timestamptz[]) -> varchar as well. For now we can always output in UTC and it would be easier to fix after we migrate timezone handling to #12747

Decimal is first tried to convert f64 to fit in JSON number; if it fails, convert to string. I also tried with PG, PG seems support a Jsonb number out of the range of IEEE 754 double:

int256 is RW only, so I directly convert it to string.

I would suggest we just report an out-of-range error for such case, if doable. RFC 8259 acknowledges the fact that only IEEE 754 double has good interoperability. PostgreSQL jsonb uses its variable length decimal (aka numeric), which is much larger than our existing decimal. It does not have to be a blocker for the common use cases where numbers are within the range, before we support the larger range. Switching from an error to a number is not considered a breaking change, but switching from a string is.

src/expr/impl/src/scalar/to_jsonb.rs

wangrunji0408 · 2023-11-01T06:25:28Z

src/expr/impl/src/scalar/to_jsonb.rs

+    let mut builder = jsonbb::Builder::default();
+    builder.begin_object();
+
+    let names: Vec<&str> = data_type.as_struct().names().collect();


We can also zip the names to avoid collection to vec.

Sometimes the names vec is empty, for instance:

select to_jsonb(ARRAY[3,4,5,6]);

Any suggestion for this situation regarding to zip?

It seems that for unnamed fields, pg always names them as f1, f2, .... Then I think we should code this behavior in StructType and make names() always return non-empty.

postgres=# select (row(1, 2)).*; f1 | f2 ----+---- 1 | 2

Seems code is also using names().

Let me double check and fix it in the jsonb_agg enhancement PR.

src/expr/impl/src/scalar/to_jsonb.rs

KeXiangWang · 2023-11-02T06:11:24Z

Still have some issues:

For Decimal, for now we convert it to f64 (to str if it's inf/-inf/nan). It may lose some precision. Currently we cannot find a good way to convert to f64 without precision lose. Because f64 conversion considers precision lose as acceptable. For example, f64::from(9.999999999900999) == 10.0 is expected. Please let me know if you have any ideas.
Int256 face the similar problem. But I think currently we can try convert it to i64, return an error if it fails, as Xiangjin suggested.
Now one small problem is that the message NOTICE: Your session timezone is UTC. is shown for all types of array and struct. For example, even to_jsonb(array[1, 2,3]) will produce this message. This should be avoided.

I'll take a look tomorrow to fix 2 and 3.

codecov · 2023-11-02T07:27:45Z

Codecov Report

Merging #13161 (c2e6447) into main (57f75da) will decrease coverage by 0.04%.
The diff coverage is 1.44%.

@@            Coverage Diff             @@
##             main   #13161      +/-   ##
==========================================
- Coverage   68.06%   68.03%   -0.04%     
==========================================
  Files        1515     1516       +1     
  Lines      257111   257257     +146     
==========================================
+ Hits       175013   175028      +15     
- Misses      82098    82229     +131

Flag	Coverage Δ
rust	`68.03% <1.44%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
src/frontend/src/binder/expr/function.rs	`78.30% <100.00%> (+0.01%)`	⬆️
src/frontend/src/expr/pure.rs	`87.69% <ø> (ø)`
src/common/src/types/jsonb.rs	`37.75% <50.00%> (+4.05%)`	⬆️
src/expr/impl/src/aggregate/jsonb_agg.rs	`0.00% <0.00%> (ø)`
src/expr/impl/src/scalar/to_jsonb.rs	`0.00% <0.00%> (ø)`

... and 3 files with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

wangrunji0408

LGTM! Thanks

src/common/src/types/jsonb.rs

wangrunji0408 · 2023-11-02T10:24:50Z

src/expr/impl/src/scalar/to_jsonb.rs

+    let mut builder = jsonbb::Builder::default();
+    builder.begin_object();
+
+    let names: Vec<&str> = data_type.as_struct().names().collect();


It seems that for unnamed fields, pg always names them as f1, f2, .... Then I think we should code this behavior in StructType and make names() always return non-empty.

postgres=# select (row(1, 2)).*; f1 | f2 ----+---- 1 | 2

stdrc

LGTM

xiangjinwu · 2023-11-02T15:55:04Z

To summarize:

bool
- converts to jsonb bool trivially
int16 / int32
- converts to jsonb number trivially
float32 / float64
- converts to jsonb number except for nan / inf / -inf
int64 / decimal / int256
- may not fit into IEEE 754 double, but PostgreSQL uses number (backed by decimal); we can report out-of-range error
varchar
- converts to jsonb string trivially
interval / date / time / bytea
- converts to jsonb string using ToText
- bytea does not leverage From<&[u8]> for Value due to conflicting behavior, or the orphan rule
timestamp
- converts to jsonb string using a new format, with T as separator
timestamptz
- converts to jsonb string using a new format and requires implicit session timezone
jsonb
- noop From<T> for T
list
- recursive with special care given to bytea and timestamptz
struct
- recursive with special care given to bytea and timestamptz
serial
- currently jsonb string using ToText, but we provide no semantic guarantee on it and may change it freely in the future

Some suggestions:

Given the implicit session timezone handling is going to be refactored soon, we may choose not to support it here the old way. Practically it is acceptable to only support 2006-01-02T22:04:05Z and the uses of 2006-01-02T15:04:05-07:00 in json are rare.
Rather than From<T> for Value we can also do ToJsonb for T. By using a dedicated new trait we avoid the bytea conflict problem. In context other than the to_jsonb series of functions, there may other preferred way to convert a date or an int64 to json value. For example, a date can be encoded as number of days since 1970-01-01 (in debezium sink), or an int64 can be encoded as a string for better interoperability (as defined by proto3). From<T> for Value can be too general. Furthermore, we already have a trait definition here.
As shown above, we do not want to provide any guarantee on serial. Today it actually contains an integer - but I am not saying we should convert it to a jsonb number. A string is more future-proof. And according to PostgreSQL it should just use text representation as everything else.

test=# select xmin, pg_typeof(xmin), to_jsonb(xmin) from test;
 xmin | pg_typeof | to_jsonb 
------+-----------+----------
  746 | xid       | "746"
(1 row)

wangrunji0408 · 2023-11-02T17:04:47Z

bytea does not leverage From<&[u8]> for Value due to conflicting behavior, or the orphan rule

Maybe it's time to introduce the Bytea & ByteaRef type? Functions like str_to_bytea can be collected into this new type.

Furthermore, we already have a trait definition here.

Right. We should unify these traits. The other remaining task is to support any type as the argument of jsonb_agg.

KeXiangWang · 2023-11-02T17:19:12Z

introduce the Bytea & ByteaRef type

Agree. Let's do it in another PR.

We should unify these traits.

Yes. I'm now trying to take use of the ToJson trait to refactor these codes.

KeXiangWang · 2023-11-02T23:20:57Z

Update:

int64 / decimal / int256 will be converted and expressed in F64 (except for nan / inf / -inf)
Removed the implementation of timestamptz in time zone, which will be handled be future refactoring soon.
Created one issue to track the float formating problem mentioned above.

The other remaining task is to support any type as the argument of jsonb_agg.

Will make a new PR to do that.

xiangjinwu

Mostly lgtm

xiangjinwu · 2023-11-03T03:23:57Z

src/expr/impl/src/scalar/to_jsonb.rs

+impl ToJsonb for i64 {
+    fn add_to(self, builder: &mut Builder) -> Result<()> {
+        let res: F64 = self
+            .try_into()
+            .map_err(|_| ExprError::CastOutOfRange("IEEE 754 double"))?;


Just sharing more on this, given the e2e test failure in CI:

RFC 8259 allows this: type Number = f64.

PostgreSQL does this: type Number = Decimal.

PostgreSQL decimal can even hold f64::MAX

serde_json/simd-json/jsonbb does this: enum Number {I64, U64, F64}

serde_json has a feature arbitrary_precision to type Number = String

simd-json has a feature 128bit to enum Number {I64, U64, F64, I128, U128}

jsonbb may be extended to enum Number {I64, U64, F64, Decimal}

This requires a variable length decimal. rust_decimal is not enough.

Back to to_jsonb: we do have the capability to avoid precision loss on i64 today, but using the lossy f64 feels more consistent with decimal / int256 which we cannot encode losslessly yet. Both sounds acceptable to me. And the .0 issue would cause less problems after the jsonbb serializer fix.

e2e_test/batch/basic/to_jsonb.slt.part

README.md

fuyufjh · 2023-11-03T03:51:17Z

int64 / decimal / int256
may not fit into IEEE 754 double, but PostgreSQL uses number (backed by decimal); we can report out-of-range error

Tips: We don't have anyway to report out-of-range error during streaming. Instead, the whole JSON field will be filled with null and a warning log will be printed.

I would recommend to just (implicitly) convert int64 / decimal to float64, regardless of the precision loss, which looks intuitive to me. For int256, string might be better because it's majorly designed for blockchain address, which usually use all the 256-bits.

xiangjinwu · 2023-11-03T04:10:49Z

int64 / decimal / int256
may not fit into IEEE 754 double, but PostgreSQL uses number (backed by decimal); we can report out-of-range error

Tips: We don't have anyway to report out-of-range error during streaming. Instead, the whole JSON field will be filled with null.

I would recommend to just (implicitly) convert int64 / decimal to float64, regardless of the precision loss, which looks intuitive to me. For int256, string might be better because it's majorly designed for blockchain address, which usually use all the 256-bits.

There are two types of errors here: precision loss and out-of-range.
- There is no From<i64> for f64 in rust because it can cause precision loss. But i64::MAX < f64::MAX and we did allow the loss following PostgreSQL and added From<i64> for OrderedFloat<f64>.
- Out-of-range is for values greater than f64::MAX. I should not have mentioned it in this issue, as it is super large and covers all of our int64 / decimal / int256. It was me that confused these two errors at the beginning. f64 out of range is only possible in PostgreSQL decimal 1e309. That is, practically, the out-of-range error in this PR is unreachable.
Good point on int256. So let's treat it as "everything else uses text representation" rather than in the number category.

KeXiangWang requested a review from a team as a code owner October 31, 2023 01:43

github-actions bot added the type/feature label Oct 31, 2023

KeXiangWang requested review from st1page, wangrunji0408 and xiangjinwu October 31, 2023 01:44

KeXiangWang marked this pull request as draft October 31, 2023 04:25

KeXiangWang changed the title ~~feat(expr): add to_json~~ feat(expr): add to_jsonb Nov 1, 2023

stdrc self-requested a review November 1, 2023 03:08

wangrunji0408 reviewed Nov 1, 2023

View reviewed changes

wangrunji0408 mentioned this pull request Nov 1, 2023

feat(expr): add jsonb_build_array/object function #13198

Merged

8 tasks

KeXiangWang marked this pull request as ready for review November 2, 2023 05:58

fuyufjh requested a review from wangrunji0408 November 2, 2023 06:15

wangrunji0408 approved these changes Nov 2, 2023

View reviewed changes

stdrc reviewed Nov 2, 2023

View reviewed changes

KeXiangWang added 2 commits November 2, 2023 16:09

feat(expr): add to_json

e281a87

refactor using ToJsonb

a0d1b89

KeXiangWang force-pushed the wkx/to_jsonb branch from 3183f3d to a0d1b89 Compare November 2, 2023 22:05

remove timestamptz in time zone for now

b2e145f

KeXiangWang mentioned this pull request Nov 2, 2023

bug(expr): format of float of Jsonb #13224

Open

KeXiangWang force-pushed the wkx/to_jsonb branch from a323671 to 116944e Compare November 2, 2023 23:42

clean

6f67496

KeXiangWang force-pushed the wkx/to_jsonb branch from 116944e to 6f67496 Compare November 2, 2023 23:45

xiangjinwu approved these changes Nov 3, 2023

View reviewed changes

KeXiangWang added 3 commits November 3, 2023 01:19

int256 -> str & update test

2baa1a8

Merge branch 'main' into wkx/to_jsonb

c59126c

fmt

c2e6447

KeXiangWang enabled auto-merge November 3, 2023 05:50

KeXiangWang added this pull request to the merge queue Nov 3, 2023

KeXiangWang mentioned this pull request Nov 3, 2023

feat(expr): to_jsonb #12834

Closed

Merged via the queue into main with commit 624094c Nov 3, 2023
29 of 30 checks passed

KeXiangWang deleted the wkx/to_jsonb branch November 3, 2023 06:19

neverchanje mentioned this pull request Nov 3, 2023

Document: feat(expr): add to_jsonb risingwavelabs/risingwave-docs#1475

Closed

CharlieSYH mentioned this pull request Nov 6, 2023

Doc to_jsonb() risingwavelabs/risingwave-docs#1478

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(expr): add to_jsonb #13161

feat(expr): add to_jsonb #13161

KeXiangWang commented Oct 31, 2023 •

edited

Loading

KeXiangWang commented Oct 31, 2023

wangrunji0408 commented Oct 31, 2023

xiangjinwu commented Oct 31, 2023 •

edited

Loading

wangrunji0408 Nov 1, 2023

KeXiangWang Nov 2, 2023

wangrunji0408 Nov 2, 2023

KeXiangWang Nov 3, 2023

KeXiangWang commented Nov 2, 2023

codecov bot commented Nov 2, 2023 •

edited

Loading

wangrunji0408 left a comment

wangrunji0408 Nov 2, 2023

stdrc left a comment

xiangjinwu commented Nov 2, 2023

wangrunji0408 commented Nov 2, 2023

KeXiangWang commented Nov 2, 2023 •

edited

Loading

KeXiangWang commented Nov 2, 2023 •

edited

Loading

xiangjinwu left a comment

xiangjinwu Nov 3, 2023

fuyufjh commented Nov 3, 2023 •

edited

Loading

xiangjinwu commented Nov 3, 2023

feat(expr): add to_jsonb #13161

feat(expr): add to_jsonb #13161

Conversation

KeXiangWang commented Oct 31, 2023 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

KeXiangWang commented Oct 31, 2023

wangrunji0408 commented Oct 31, 2023

xiangjinwu commented Oct 31, 2023 • edited Loading

wangrunji0408 Nov 1, 2023

Choose a reason for hiding this comment

KeXiangWang Nov 2, 2023

Choose a reason for hiding this comment

wangrunji0408 Nov 2, 2023

Choose a reason for hiding this comment

KeXiangWang Nov 3, 2023

Choose a reason for hiding this comment

KeXiangWang commented Nov 2, 2023

codecov bot commented Nov 2, 2023 • edited Loading

Codecov Report

wangrunji0408 left a comment

Choose a reason for hiding this comment

wangrunji0408 Nov 2, 2023

Choose a reason for hiding this comment

stdrc left a comment

Choose a reason for hiding this comment

xiangjinwu commented Nov 2, 2023

wangrunji0408 commented Nov 2, 2023

KeXiangWang commented Nov 2, 2023 • edited Loading

KeXiangWang commented Nov 2, 2023 • edited Loading

xiangjinwu left a comment

Choose a reason for hiding this comment

xiangjinwu Nov 3, 2023

Choose a reason for hiding this comment

fuyufjh commented Nov 3, 2023 • edited Loading

xiangjinwu commented Nov 3, 2023

KeXiangWang commented Oct 31, 2023 •

edited

Loading

xiangjinwu commented Oct 31, 2023 •

edited

Loading

codecov bot commented Nov 2, 2023 •

edited

Loading

KeXiangWang commented Nov 2, 2023 •

edited

Loading

KeXiangWang commented Nov 2, 2023 •

edited

Loading

fuyufjh commented Nov 3, 2023 •

edited

Loading