Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connector: support protobuf map type in source #13387

Closed
Tracked by #10479 ...
BugenZhao opened this issue Nov 13, 2023 · 20 comments · Fixed by #18512
Closed
Tracked by #10479 ...

connector: support protobuf map type in source #13387

BugenZhao opened this issue Nov 13, 2023 · 20 comments · Fixed by #18512

Comments

@BugenZhao
Copy link
Member

BugenZhao commented Nov 13, 2023

Prior to #12126, a map field is accepted and map to struct<key, value>[] in RisingWave, which might not be expected as suggested by @neverchanje so disabled there.

JSONB might be another good candidate, which supports efficient indexing.

@BugenZhao BugenZhao changed the title support protobuf map type connector: support protobuf map type Nov 13, 2023
@github-actions github-actions bot added this to the release-1.5 milestone Nov 13, 2023
@BugenZhao BugenZhao changed the title connector: support protobuf map type connector: support protobuf map type of source Nov 13, 2023
@BugenZhao BugenZhao changed the title connector: support protobuf map type of source connector: support protobuf map type in source Nov 13, 2023
@BugenZhao BugenZhao added type/enhancement Improvements to existing implementation. type/feature component/connector and removed type/enhancement Improvements to existing implementation. labels Nov 13, 2023
@BugenZhao

This comment was marked as resolved.

@xiangjinwu
Copy link
Contributor

xiangjinwu commented Nov 13, 2023

https://protobuf.dev/programming-guides/proto3/#maps
I would like to list 4 candidate solutions for map<K, V>:

  • (A) struct<key K, value V>[]
  • (B) jsonb
  • (C) hstore
  • (D) map<K, V>

I will focus on the downside of each approach:

  • struct[] does not support efficient lookup natively, as discussed above. One possible workaround is to sort by key during ingestion, and provide a dedicated function/operator to do binary search.
  • jsonb means everything inside would also be jsonb rather than strongly typed. Proto allows K to be integral types but jsonb requires keys to be string. Proto requires V to be the same for all kv pairs but jsonb allows some to be jsonb number and some to be jsonb string. In addition jsonb does not have good support of types likes int64 / bytea / timestamptz.
  • hstore is the official PostgreSQL extension for kv pairs. However it is not generic and requires both K and V to be string.
  • map<K, V> means we add a new dedicated type, because none of the options above maps perfectly. This takes more time to implement, but has the best compatibility with similar types in protobuf, avro, arrow, DuckDB, Flink, ...

@BugenZhao
Copy link
Member Author

I would like to list 4 candidate solutions for map<K, V>:

Thanks for your detailed exploration! Personally, I would stand for struct[] or map<K, V>. Some ideas:

  • It's inevitable to introduce a (series of?) indexing function, no matter which type it is.
  • Assuming the maps ingested from external sources are not likely to be mutated later, simply sorting all entries in the struct[] and mark the sorted flag will enable efficient indexing.

@tabVersion

This comment was marked as resolved.

@fuyufjh
Copy link
Member

fuyufjh commented Nov 14, 2023

-1 for struct<key K, value V>[] because it lost the semantics of map.

map is built for mapping i.e. looking up by a key to get a value. It's important to make data type "lossless", but it's more important to keep it as original semantics.

Here, I prefer jsonb. Obviously, the idea map in protobuf is brought from semi-structured format, while in PG, semi-structure are mostly handled by JSONB

@BugenZhao
Copy link
Member Author

It's okay to make it jsonb if we find most of the use cases are map<string, string>. However, it's also possible for the value to be a complex message. In this case we need another set of type mappings from protobuf to JSON, in additional to the original protobuf to RisingWave data type.

@fuyufjh
Copy link
Member

fuyufjh commented Nov 14, 2023

In this case we need another set of type mappings from protobuf to JSON, in additional to the original protobuf to RisingWave data type.

It's inevitable since we have already supported Any.

@fuyufjh
Copy link
Member

fuyufjh commented Nov 14, 2023

Thinking from the users' perspective, if we do really use struct<key K, value V>[] for map, at least we need to provide functions to

  • lookup value by key
  • check whether a key exist
  • update or delete a field

Adding up together, I would rather introduce a new map data type in RW.

@xiangjinwu
Copy link
Contributor

xiangjinwu commented Nov 14, 2023

It's inevitable to introduce a (series of?) indexing function, no matter which type it is.

Putting development time aside, a native map<K, V> type is the best solution. So these additional struct[] functions could support both struct[] and map<K, V>.

As for jsonb, my biggest concern is that it makes all inner content jsonb, losing "the semantics of" type definition as well. For example, given a proto map<string, Location> mapped to jsonb, we would need to write col1 -> 'SFO' ->> 'latitude' rather than (lookup(col1, 'SFO')).latitude.

we need another set of type mappings from protobuf to JSON

This is not a big deal. Proto3 has an official definition of how proto maps to JSON. But do note that int64 maps to string.

@fuyufjh
Copy link
Member

fuyufjh commented Nov 14, 2023

As for jsonb, my biggest concern is that it makes all inner content jsonb, losing "the semantics of" type definition as well. For example, given a proto map<string, Location> mapped to jsonb, we would need to write col1 -> 'SFO' ->> 'latitude' rather than (lookup(col1, 'SFO')).latitude.

Yeah, I understand this, and I guess it's also the major concern of all of us. For me, because Any has already broken this, I feel acceptable.

@xiangjinwu
Copy link
Contributor

xiangjinwu commented Nov 14, 2023

google.protobuf.Any is intentionally a dynamic type, but map could be a static type. But yes, sounds more acceptable before we have a dedicated map type.

Btw do we use the official proto3 json mapping in google.protobuf.Any? Is an int64 in any message represented by json string or json number?

@tabVersion

This comment was marked as resolved.

@fuyufjh
Copy link
Member

fuyufjh commented Nov 14, 2023

Btw do we use the official proto3 json mapping in google.protobuf.Any? Is an int64 in any message represented by json string or json number?

IIRC, Any type can only point to a message rather than primitive types.

I guess @xiangjinwu means an int64 inside a message in an Any

@BugenZhao
Copy link
Member Author

BugenZhao commented Nov 14, 2023

My concernings for a native map type is:

  • Introduction of the data type itself is simple. However, there seems to be a considerable amount of work in the parser and expressions to make it first-class, just like other semi-structured types.
  • We need to carefully tune the performance, apply the knowledge on the JSONB and the List type once again.
  • Such challenges would we have to face, the Protobuf seems to be the only usage. So the test coverage could be limited as well.

Thinking from the users' perspective, if we do really use struct<key K, value V>[] for map, at least we need to provide functions to

  • lookup value by key

  • check whether a key exist

  • update or delete a field

Adding up together, I would rather introduce a new map data type in RW.

As above, it might not be that simple...

@xiangjinwu
Copy link
Contributor

Had another offline discussion with @fuyufjh and we aligned on these:

  • hstore is out because it is only nonnull string -> nullable string.
  • jsonb may be sufficient to cover common cases. It is a popular fallback when different system have different support of complex types (array, struct, map, enum, union, ...)
    • Further simplification: value can only be a base type (types except array/struct).
    • There will be two styles: int64 as number (postgres) and int64 as string (protobuf). The former is more friendly for further processing in RisingWave. Support of the latter can be added if requested to be strongly protobuf-compatible.
  • map may be an overkill unless we get frequent complaints about the jsonb solution above. By that time we would skip struct[] workaround and do map directly.

@fuyufjh
Copy link
Member

fuyufjh commented Jan 10, 2024

the Protobuf seems to be the only usage.

So we have a new usage now #14215 😄

@fuyufjh
Copy link
Member

fuyufjh commented Jan 10, 2024

I just realized that many systems introduced dedicated map type, including

Perhaps Postgres is the special one... Shall we add support for map as well?

@xiangjinwu
Copy link
Contributor

So we have a new usage now #14215 😄

Concern of using jsonb in that issue for kafka headers is the missing of native bytea/bytes support. Or actually there are multiple ways to represent 8 bytes[0x61 0x30 0x00 0x20 0xf0 0x9f 0x8c 0x8a] in json as a string:

  • 12-byte "YTAAIPCfjIo=" (base64)
  • 16-byte "61300020f09f8c8a" (hex)
  • 18-byte "\\x61300020f09f8c8a" (hex with prefix \x, pgwire text format)
  • 23-byte "a0\\000 \\360\\237\\214\\212" (escape) or 11-byte "a0\\000 \ud83c\udf0a" or 11-byte "a0\\000 🌊"

It is also easy to convert from these forms into a bytea in RisingWave today:

with t(headers) as (select '{
  "a": "YTAAIPCfjIo=",
  "b": "61300020f09f8c8a",
  "c": "\\x61300020f09f8c8a",
  "d0": "a0\\000 \\360\\237\\214\\212",
  "d1": "a0\\000 \ud83c\udf0a",
  "d2": "a0\\000 🌊"
}'::jsonb) select
  octet_length(headers ->> 'a'), decode(headers ->> 'a', 'base64'),
  octet_length(headers ->> 'b'), decode(headers ->> 'b', 'hex'),
  octet_length(headers ->> 'c'), (headers ->> 'c')::bytea,            -- this format can just use cast
  octet_length(headers ->> 'd0'), decode(headers ->> 'd0', 'escape'), -- this can both decode or cast
  octet_length(headers ->> 'd0'), (headers ->> 'd0')::bytea,
  octet_length(headers ->> 'd1'), decode(headers ->> 'd1', 'escape'),
  octet_length(headers ->> 'd2'), (headers ->> 'd2')::bytea
from t;

My thoughts:

  • Yes map is popular in many systems
  • jsonb can also work as kafka headers to include bytes before map is ready, but it is easier to stick to one encoding rather than exposing an option to choose.
  • If we anticipate users would always decode to bytea first, base64 is the most compact.
  • If we anticipate users would mostly include ascii string, escape is more human readable without decoding. And it can be decoded either using a function or a cast.
  • If we anticipate users would like to inspect (but not pass as function arg, which requires decoding) hex without decoding, for example integers or raw-uuid or raw-ipv6, hex shall be chosen.

@BugenZhao BugenZhao added the needs-design Don't start your coding work before a detailed design proposed label Mar 6, 2024
@BugenZhao BugenZhao removed this from the release-1.7 milestone Mar 6, 2024
@fuyufjh fuyufjh added the help wanted Issues that need help from contributors label May 17, 2024
@xxchan xxchan self-assigned this May 22, 2024
@xxchan
Copy link
Member

xxchan commented May 28, 2024

we need another set of type mappings from protobuf to JSON

This is not a big deal. Proto3 has an official definition of how proto maps to JSON. But do note that int64 maps to string.

To elaborate this point a bit: it's actually a big deal, and another big drawback of JSON -- It lacks native support for a lot of types, and thus leaving there many choices of encoding to JSON and decoding from JSON, which means we need to provide a lot of config options.

The encoding problem include:

  • int64: Actually JSON spec doesn't accept it.
    • Some JSON libraries (serde_json, jsonbb) allow it, but some don't
    • protobuf encodes to string. RisingWave sink encodes to number (so when it overflows, it will be undefined behavior, decided by the downstream system).
  • bytes: As mentioned by xiangjin above, there are many possible ways.

Different cases include:


My opinion for the last question is that we don't provide encoding option for this case, and just choose our opinionated one.

Oh, it seems xiangjin already talked about this:

jsonb can also work as kafka headers to include bytes before map is ready, but it is easier to stick to one encoding rather than exposing an option to choose.

Copy link
Contributor

github-actions bot commented Aug 1, 2024

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

@xxchan xxchan removed no-issue-activity help wanted Issues that need help from contributors needs-design Don't start your coding work before a detailed design proposed labels Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants