Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plug hole in MDS type system: add arbitrary-precision decimal #390

Merged
merged 7 commits into from
Aug 25, 2023

Conversation

knighton
Copy link
Contributor

Problem

Streaming PR 363 seeks to add delta to MDS conversion to Streaming. This means MDS must have an answer for each of the types in pyspark.sql.types.

Non-solutions

Currently, we don't: arbitrary-precision decimals (and for that matter, ints) are not natively supported. Several workarounds are available using our current type system:

  1. Use pkl encoding, which will correctly encode and decode, but with explosively bad data usage.
  2. Use json encoding to dump the digits as str, then custom decode the field(s) in your subclass of StreamingDataset back to your arbitrary-precision int/float/Decimal type.
  3. Use bytes encoding and roll your own custom encoding /and/ decoding.

Using Pickle will make me look incompetent, you say? How bad can it be?

for ty in [int, float, Decimal]:
    a = ty(0)
    b = pickle.dumps(a)
    c = pickle.loads(b)
    assert a == c
    print(str(type(a)), len(b))
<class 'int'> 5
<class 'float'> 21
<class 'decimal.Decimal'> 42

Oh. That is, if you were expecting serializing 0 to require 4 bytes, pickle is 5x worse for floats and 10x worse for decimals. At scale, that really adds up.

Solution

This PR adds three new MDS types that are small, simple, and general: str_int, str_float, and str_decimal. We take the JSON approach of just dumping digits to str and decoding it back, providing arbitrary precision without any hassle or complexity.

Of course, ASCII digits are not the most efficient way to serialize. To support possible future work, we also include a script analyzing what the most useful configurations of fixed-size binary decimal types would be. The actual fixed-size decimal binary type(s) are not in this PR, as I would like to study the problem a bit more before committing to a certain API for them and that work is off the critical path.

Columns:
- int bits: bits used to hold the digits as a scalar.
- pow bits: bits used to hold the exponent (location of decimal place).
- precision: Digits of precision.
- dec range: Range of decimal places (half left, half right).

4-byte unsigned decimal:
    int bits | pow bits | precision | dec range
          23 |        8 |         6 |       512
          24 |        7 |         7 |       256
          27 |        4 |         8 |        32

4-byte signed decimal:
    int bits | pow bits | precision | dec range
          22 |        8 |         6 |       512
          24 |        6 |         7 |       128

8-byte unsigned decimal:
    int bits | pow bits | precision | dec range
          55 |        8 |        16 |       512
          57 |        6 |        17 |       128

8-byte signed decimal:
    int bits | pow bits | precision | dec range
          54 |        8 |        16 |       512
          57 |        5 |        17 |        64

16-byte unsigned decimal:
    int bits | pow bits | precision | dec range
         119 |        8 |        35 |       512
         120 |        7 |        36 |       256
         123 |        4 |        37 |        32

16-byte signed decimal:
    int bits | pow bits | precision | dec range
         118 |        8 |        35 |       512
         120 |        6 |        36 |       128

@XiaohanZhangCMU
Copy link
Member

Fixed a test failure, otherwise all look legitimate. Please add any comments here. Idea is to merge this PR before 363.
@knighton @karan6181

@karan6181
Copy link
Collaborator

karan6181 commented Aug 24, 2023

Great PR Description

tests/test_encodings.py Outdated Show resolved Hide resolved
tests/test_encodings.py Show resolved Hide resolved
tests/test_encodings.py Outdated Show resolved Hide resolved
@XiaohanZhangCMU XiaohanZhangCMU merged commit a93636f into main Aug 25, 2023
6 checks passed
@XiaohanZhangCMU XiaohanZhangCMU deleted the james/mds-decimal branch August 25, 2023 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants