add piping format from/to shell copy commands #64

leo-schick · 2022-09-26T11:31:30Z

mara_db/shell.py

mara_db/format.py

jankatins

I would try to make this a bit cleaner: Given that you want to use Format from this onwards and the old way should get deprecated, I would:

Add a "NativeFormat" which takes no options
Make all users error if format and any of the old ones are given
If the old ones are given: create a corresponding Format (maybe using some helper in format.py)
Handle the new Format way in the actual commandline generation

That way at some point, we can add a warn into the old way and at some point remove it completely...

mara_db/shell.py

mara_db/format.py

leo-schick · 2022-09-29T14:30:41Z

@jankatins You are right about what you write. But this might lead into breaking changes.

I was quite cautious with the changes I did to not make any breaking change. That resulted (unfortunately) into ugly code. e.g.

# ...
    elif isinstance(pipe_format, CsvFormat):
        csv_format = True
        if delimiter_char is None:
# ...

The purpose of this was to make it work that when someone uses the old and new logic combined. For example: someone upgrades their mara_db package and uses the copy command and patches manually the copy_to_stdout_command method with some custom logic (e.g. to use https://github.com/kfzteile24/sqpy). I feared that this would break at their side then.

There are already so many copy_command implementations which - for me - are hard to understand since they use often the default parameter (header=None, csv_format=None etc.) which is pretty hard to reingeneer since other databases use other defaults when passing None.
E.g. the default delimiter_char for postgres is \t, for SQL Server it is ,, but for MySQL it is not even defined and would through an ValueError exception when you use it.
A proper implementation for delimiter_char for MySQL would have been that (lets say the default for MySQL is ,) the method raises the ValueError only when the delimiter_char is not ,. That is the way SnowflakeDB is implemented.

I would like to do these changes mentioned above, but a) maybe in the next major release where we officially discontinue or drop the old parameters / or b) we implement unit tests for each database engines - especially for the copy_command method to make sure it does not break somewhere.

Another option would be to put this change from the start into a newer major version which is incompatible with older versions of mara-pipelines. This is something I tried to avoid.

jankatins · 2022-09-30T14:54:48Z

The purpose of this was to make it work that when someone uses the old and new logic combined. For example: someone upgrades their mara_db package and uses the copy command and patches manually the copy_to_stdout_command method with some custom logic (e.g. to use https://github.com/kfzteile24/sqpy). I feared that this would break at their side then.

From my point of view, this looks like a lot of complexity you are introducing for an edge case I would say should be taken care by a changelog entry: "If you do this ..., you have to change it to ...". Which points to a major release, as it is a breaking change.

I know that we used a custom athena integration: create min cli command which spits out data to stdout, create a new mara DB class, create and register the copy commands for this class, done... These will break, because of the additional pipe_format argument in copy_command(...., pipe_format: Format = None). So anyone tinkering with that API already needs to fix their code.

leo-schick · 2022-10-01T08:49:44Z

hmm... yea, you're right. I will redesign this implementation ...

leo-schick · 2023-01-27T11:42:49Z

@jankatins I redesigned the solution now. Please reivew.

* comma deimiliter * no header * no footer as it is defined by default in RFC-4180 quote_char is by default None

mara_db/shell.py

leo-schick · 2023-01-29T12:45:33Z

@jankatins You wrote:

I know that we used a custom athena integration: create min cli command which spits out data to stdout, create a new mara DB class, create and register the copy commands for this class, done...

Why not creating a PR which adds Athena support for the mara_db package? Then we can make sure that the code works even with the upcoming breaking change. Or is there something private in it you don’t wanna share?

leo-schick force-pushed the issue-56 branch 2 times, most recently from 757cfb5 to 73a270f Compare September 26, 2022 11:47

leo-schick requested a review from jankatins September 28, 2022 10:56

leo-schick mentioned this pull request Sep 29, 2022

add WriteFile command mara/mara-pipelines#89

Merged