Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read functions should return insightful errors #342

Open
tlukkezen opened this issue Jul 26, 2023 · 3 comments
Open

read functions should return insightful errors #342

tlukkezen opened this issue Jul 26, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@tlukkezen
Copy link
Collaborator

tlukkezen commented Jul 26, 2023

The read_cpt and read_bore functions have some "automagical" logic that infers the content of the file argument. The user can provide an object of types io.BytesIO | Path | str and with "engine"="auto", the content type is inferred automatically. This can result in confusing errors when erroneous input is provided.

Some examples:

Providing a non-existing path results in XMLSyntaxError

Input:

from pygef import read_cpt
read_cpt(file="non/existing/file.gef")

Response:

lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

The expectation is to get a FileNotFoundError

Providing a non-existing path and engine="gef" results in ValueError

Input:

from pygef import read_cpt
read_cpt(file="non/existing/file.gef", engine="gef")

Response:

ValueError: The selected gef file is not a cpt. Check the REPORTCODE or the PROCEDURECODE.

The expectation is to get a FileNotFoundError

Providing an erroneous gef file results in XMLSyntaxError while gef can be parsed when forced

Input:

from pygef import read_cpt
read_cpt(file="path/to/erroneous.GEF")

Response:

lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Input:

from pygef import read_cpt
read_cpt(file="path/to/erroneous.GEF", engine="gef")

Response:

CPTData(bro_id=None, research_report_date=None, ...

The expectation is to get an error that the gef file is invalid, and this response should be consistent no matter the value for engine.

@tlukkezen tlukkezen added the bug Something isn't working label Jul 26, 2023
@tlukkezen
Copy link
Collaborator Author

tlukkezen commented Jul 26, 2023

In the current implementation, we just try to parse the possible contents of file and if it fails we move on to the next. The last parsing option that fails (parsing as bro-xml by default) is the one that provides the error. This is random behaviour and is the reason that the error that is returned doesn't reflect the actual problem.

I see the following resolutions now:

  • Require more content about the file argument (e.g. with a string_content=["path","file"] argument, but that would only have a meaning for file arguments of type str (kinda ugly) (Breaking change)

  • As pandas.read_csv() does: Accept str | io.StringIO types for file and always infer the str type as a Path, and the io.StringIO as file content. This would require the user to convert a string to StringIO object. (Breaking change)

  • Return a general custom exception (e.g. CPTParsingError) when all parsing options have failed. Although this will still return a ValueError for an erroneous gef-file path.

@RDWimmers
Copy link
Member

RDWimmers commented Jul 26, 2023

I see two options now, both breaking changes:

  • Require more content about the file argument (e.g. with a string_content=["path","file"] argument, but that would only have a meaning for file arguments of type str (kinda ugly)
  • As pandas.read_csv() does: Accept str | io.StringIO types for file and always infer the str type as a Path, and the io.StringIO as file content. This would require the user to convert a string to StringIO object

I like the second one. we can set the type of filepath_or_buffer to Path | io.StringIO. then we dont use strings and can use the Path.is_file() method.

@tversteeg
Copy link
Member

When a string has the form of a path separated by slashes or even a single short word it can never be a valid XML or GEF file right? All XML files need to start with a < character, so that's easy, and all GEF files have the form of key: value, so I think a proper heuristic would be:

Is almost certainly path if:

  1. Is the length <= 255 characters
  2. Does it end with case-insensitive .gef or .xml
  3. Does it not contain any < or : characters
  4. Can it be opened from disk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants