Reading parquet in Python has some rough edges, making it hard to diagnose problems #2139

prrao87 · 2023-10-03T21:35:23Z

I've upgraded to the latest version 0.0.9 and there are a few rough edges with reading parquet from Python.

Problem scenario

If I have a parquet file with two columns, id and name, of type INT64 and STRING respectively.

During development, sometimes we may specify the wrong data type during table creation, for example in this case:

def create_node_table(conn: Connection) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            Data(
                id STRING,
                name STRING
            )
        """
    )

In this case, the data type of the id should be INT64, but was specified as STRING, it just segfaults, making it very hard for the Python developer to know what they did wrong (it's not a Kùzu bug, and could have very easily been fixed if the user just had a better error message).

[1]    5208 segmentation fault  python test.py

In the reverse situation, the error is even more complex. Say I have a parquet file with the same two columns, id and name, but this time, the id column is also of type STRING like the name column is.

def create_node_table(conn: Connection) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            Data(
                id INT64,
                name STRING
            )
        """
    )

This time, if the user specified the id column in the node table creation as of the type INT64 (by mistake, when they should have specified STRING), it doesn't segfault but instead gives this weird error.

  File "/code/.venv/lib/python3.11/site-packages/kuzu/connection.py", line 88, in execute
    self._connection.execute(
RuntimeError: ColumnChunk::templateCopyStringArrowArray

What should be a 2-second fix to the data type now becomes a tedious exercise for the Python developer to go back and inspect the data and learn (after some deliberation) that the data type specified in Python was wrong.

Developer experience focus

These sorts of issues are more generic, and I suppose adding these edge cases to the test suite could help avoid such issues, but it's impossible for the Kùzu team to know all the kinds of ways users may break the system.

In this case, malformed data types (or the user simply making a mistake) is a common enough scenario that the test suite can be expanded, but the longer term problem of C++ error propagation to Python is still something to ponder about. What should be a very simple fix, can take minutes of unnecessary developer effort to build even the simplest graphs. This is quite a big source of user frustration in Python, and it's pretty natural to expect a large portion of future Kùzu users to come from Python, so this is a serious enough issue to discuss -- I think it really makes sense to deep-dive into how to present better error messages to Python users, cc @andyfengHKU @semihsalihoglu-uw

The text was updated successfully, but these errors were encountered:

andyfengHKU · 2023-10-03T22:10:49Z

Hi @prrao87, thanks for the bug report! I'll give some of my thoughts first.

Data loading bug

First I need admit in both scenarios the behaviour is unacceptable. A system should handle malformed input and error properly. For this specific case, the right behaviour is to strictly follow DDL and always try to cast the input data based on DDL. If a casting cannot succeed, e.g. casting a STRING to INT, a runtime error will be thrown.

@acquamarin will send out a fix soon.

Developer experience

Yes, this is something we didn't do a good job. I think the deeper problem is that we don't have good error messages/reporting mechanism on the cpp side, so the python API doesn't have proper error message to begin with. We received multiple feedbacks related to this, and I also believe this is something we should address with high priority now. I'll make the following 2 proposals to the team this week.

Integrate a fuzzing testing framework to ensure the system does not crash in any situation. Our current testing framework primarily focus on the correctness of result and does not test too many malformed queries.
Rewrite our exceptions to throw proper error message for non-internal exceptions.

Once we make a decision we will open an issue to describe our roadmap and will keep u updated on the solution.

semihsalihoglu-uw · 2023-10-03T22:19:45Z

This is one of the most important things we should focus on. The main problem is the first bullet point: "Our current testing framework primarily focus on the correctness of result and does not test too many malformed queries.". We need to prioritize test cases that should throw errors, wrong inputs, malformed queries etc. to catch these and as our test cases catch these, we will write the necessary code to improve our error messages.

The first thing to do is for code reviewers to think seriously about malformed queries that should error and ask the PR submitters to add those tests.

prrao87 · 2023-10-03T22:28:21Z

Thank you! What you both said makes a lot of sense 👍🏽

prrao87 · 2023-10-14T20:03:47Z

@andyfengHKU and @acquamarin I think the new parquet reader test suite definitely needs to be looked at. Version 0.0.10 produces this bug whose behaviour is different from the previous version.

Looking at things from a larger perspective, most of my parquet-driven workflows involve polars in some shape or form, but I believe that the tests on your end assume that the parquet comes from pyarrow? I wonder if it makes sense to increase the diversity of parquet files being tested, taking parquets from numerous sources like polars, pyarrow, duckdb, etc. so that the upcoming versions don't show any regressions. Considering that the parquet reader you guys have written is built from scratch, it's expected to have a lot of rough edges, so I hope that this request will be considered for the upcoming release!

semihsalihoglu-uw · 2024-01-08T21:50:00Z

I assume this issue is still not resolved, right?

andyfengHKU · 2024-07-01T14:42:24Z

We now throw exception if column type doesn't match parquet type. Though we should take one step further to perform casting for numerical cases.

andyfengHKU · 2024-08-13T19:07:33Z

This should be solved with #4041

ray6080 added the testing Testing related issues label Oct 9, 2023

ray6080 mentioned this issue Oct 10, 2023

Bug fix release V0.0.10 #2175

Closed

10 tasks

ray6080 assigned andyfengHKU and acquamarin Oct 12, 2023

andyfengHKU mentioned this issue Oct 13, 2023

Validate file header for LOAD and COPY #2210

Merged

semihsalihoglu-uw added bug Something isn't working high-priority labels Jan 8, 2024

andyfengHKU closed this as completed Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading parquet in Python has some rough edges, making it hard to diagnose problems #2139

Reading parquet in Python has some rough edges, making it hard to diagnose problems #2139

prrao87 commented Oct 3, 2023 •

edited

Loading

andyfengHKU commented Oct 3, 2023 •

edited

Loading

semihsalihoglu-uw commented Oct 3, 2023

prrao87 commented Oct 3, 2023

prrao87 commented Oct 14, 2023 •

edited

Loading

semihsalihoglu-uw commented Jan 8, 2024

andyfengHKU commented Jul 1, 2024

andyfengHKU commented Aug 13, 2024

Reading parquet in Python has some rough edges, making it hard to diagnose problems #2139

Reading parquet in Python has some rough edges, making it hard to diagnose problems #2139

Comments

prrao87 commented Oct 3, 2023 • edited Loading

Problem scenario

Developer experience focus

andyfengHKU commented Oct 3, 2023 • edited Loading

Data loading bug

Developer experience

semihsalihoglu-uw commented Oct 3, 2023

prrao87 commented Oct 3, 2023

prrao87 commented Oct 14, 2023 • edited Loading

semihsalihoglu-uw commented Jan 8, 2024

andyfengHKU commented Jul 1, 2024

andyfengHKU commented Aug 13, 2024

prrao87 commented Oct 3, 2023 •

edited

Loading

andyfengHKU commented Oct 3, 2023 •

edited

Loading

prrao87 commented Oct 14, 2023 •

edited

Loading