Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading parquet in Python has some rough edges, making it hard to diagnose problems #2139

Closed
prrao87 opened this issue Oct 3, 2023 · 7 comments
Assignees
Labels
bug Something isn't working high-priority testing Testing related issues

Comments

@prrao87
Copy link
Member

prrao87 commented Oct 3, 2023

I've upgraded to the latest version 0.0.9 and there are a few rough edges with reading parquet from Python.

Problem scenario

If I have a parquet file with two columns, id and name, of type INT64 and STRING respectively.

During development, sometimes we may specify the wrong data type during table creation, for example in this case:

def create_node_table(conn: Connection) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            Data(
                id STRING,
                name STRING
            )
        """
    )

In this case, the data type of the id should be INT64, but was specified as STRING, it just segfaults, making it very hard for the Python developer to know what they did wrong (it's not a Kùzu bug, and could have very easily been fixed if the user just had a better error message).

[1]    5208 segmentation fault  python test.py

In the reverse situation, the error is even more complex. Say I have a parquet file with the same two columns, id and name, but this time, the id column is also of type STRING like the name column is.

def create_node_table(conn: Connection) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            Data(
                id INT64,
                name STRING
            )
        """
    )

This time, if the user specified the id column in the node table creation as of the type INT64 (by mistake, when they should have specified STRING), it doesn't segfault but instead gives this weird error.

  File "/code/.venv/lib/python3.11/site-packages/kuzu/connection.py", line 88, in execute
    self._connection.execute(
RuntimeError: ColumnChunk::templateCopyStringArrowArray

What should be a 2-second fix to the data type now becomes a tedious exercise for the Python developer to go back and inspect the data and learn (after some deliberation) that the data type specified in Python was wrong.

Developer experience focus

These sorts of issues are more generic, and I suppose adding these edge cases to the test suite could help avoid such issues, but it's impossible for the Kùzu team to know all the kinds of ways users may break the system.

In this case, malformed data types (or the user simply making a mistake) is a common enough scenario that the test suite can be expanded, but the longer term problem of C++ error propagation to Python is still something to ponder about. What should be a very simple fix, can take minutes of unnecessary developer effort to build even the simplest graphs. This is quite a big source of user frustration in Python, and it's pretty natural to expect a large portion of future Kùzu users to come from Python, so this is a serious enough issue to discuss -- I think it really makes sense to deep-dive into how to present better error messages to Python users, cc @andyfengHKU @semihsalihoglu-uw

@andyfengHKU
Copy link
Contributor

andyfengHKU commented Oct 3, 2023

Hi @prrao87, thanks for the bug report! I'll give some of my thoughts first.

Data loading bug

First I need admit in both scenarios the behaviour is unacceptable. A system should handle malformed input and error properly. For this specific case, the right behaviour is to strictly follow DDL and always try to cast the input data based on DDL. If a casting cannot succeed, e.g. casting a STRING to INT, a runtime error will be thrown.

@acquamarin will send out a fix soon.

Developer experience

Yes, this is something we didn't do a good job. I think the deeper problem is that we don't have good error messages/reporting mechanism on the cpp side, so the python API doesn't have proper error message to begin with. We received multiple feedbacks related to this, and I also believe this is something we should address with high priority now. I'll make the following 2 proposals to the team this week.

  • Integrate a fuzzing testing framework to ensure the system does not crash in any situation. Our current testing framework primarily focus on the correctness of result and does not test too many malformed queries.
  • Rewrite our exceptions to throw proper error message for non-internal exceptions.

Once we make a decision we will open an issue to describe our roadmap and will keep u updated on the solution.

@semihsalihoglu-uw
Copy link
Contributor

This is one of the most important things we should focus on. The main problem is the first bullet point: "Our current testing framework primarily focus on the correctness of result and does not test too many malformed queries.". We need to prioritize test cases that should throw errors, wrong inputs, malformed queries etc. to catch these and as our test cases catch these, we will write the necessary code to improve our error messages.

The first thing to do is for code reviewers to think seriously about malformed queries that should error and ask the PR submitters to add those tests.

@prrao87
Copy link
Member Author

prrao87 commented Oct 3, 2023

Thank you! What you both said makes a lot of sense 👍🏽

@prrao87
Copy link
Member Author

prrao87 commented Oct 14, 2023

@andyfengHKU and @acquamarin I think the new parquet reader test suite definitely needs to be looked at. Version 0.0.10 produces this bug whose behaviour is different from the previous version.

Looking at things from a larger perspective, most of my parquet-driven workflows involve polars in some shape or form, but I believe that the tests on your end assume that the parquet comes from pyarrow? I wonder if it makes sense to increase the diversity of parquet files being tested, taking parquets from numerous sources like polars, pyarrow, duckdb, etc. so that the upcoming versions don't show any regressions. Considering that the parquet reader you guys have written is built from scratch, it's expected to have a lot of rough edges, so I hope that this request will be considered for the upcoming release!

@semihsalihoglu-uw semihsalihoglu-uw added bug Something isn't working high-priority labels Jan 8, 2024
@semihsalihoglu-uw
Copy link
Contributor

I assume this issue is still not resolved, right?

@andyfengHKU
Copy link
Contributor

We now throw exception if column type doesn't match parquet type. Though we should take one step further to perform casting for numerical cases.

@andyfengHKU
Copy link
Contributor

This should be solved with #4041

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high-priority testing Testing related issues
Projects
None yet
Development

No branches or pull requests

5 participants