Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CsvEnumerator .size can be wrong with quoted embedded newlines in a cell #7

Open
jjowdy opened this issue Apr 28, 2024 · 2 comments
Open

Comments

@jjowdy
Copy link

jjowdy commented Apr 28, 2024

There's some great stuff here! I noticed one thing that might be a problem somewhere, but probably isn't in practice:

If a (quoted) CSV cell contains embedded newlines—which unfortunately can happen, especially with web form input or something like option-return in Google Sheets—then using wc naively as in this line will return a count larger than the number of rows:

       count = `wc -l < #{filepath}`.strip.to_i

It may not matter much, but it does mean that the enumerator's .size will be wrong.

Unfortunately, I don't think in general there's a good way to deal with this other than doing a first pass with the same CSV library (or equivalent parsing logic) to get the row count. But as long as size is at least as large as the number of rows, it could just be reasonable to state as a known limitation.

@mperham
Copy link

mperham commented Jun 4, 2024

I noticed the array_enumerator method has the same issue, the size will be wrong if any elements are dropped by the cursor because it references the original array and not the pruned one.

- array.each_with_index.drop(cursor || 0).to_enum { array.size }
+ x = array.each_with_index.drop(cursor || 0)
+ x.to_enum { x.size }

@fatkodima
Copy link
Owner

I noticed the array_enumerator method has the same issue, the size will be wrong if any elements are dropped by the cursor because it references the original array and not the pruned one.

I basically copied the implementation from the job-iteration. It is done this way for all the enumerators. I think this was done on purpose to get the size of the whole iterable collection. Since we kinda implement and return our own enumerators, we should return the size of the enumerated collection, like ruby built it enumerators do, not depending on which position in it we are now) and it also allows to track progress by comparing current cursor to enum.size). https://github.com/Shopify/job-iteration/blob/10787b962a33ea26c2b75f6eaffd6f773eb06937/test/unit/csv_enumerator_test.rb#L47-L50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants