Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLCP skips bad records without reporting when using splits #147

Open
eurochriskelly opened this issue Jun 26, 2020 · 1 comment
Open

MLCP skips bad records without reporting when using splits #147

eurochriskelly opened this issue Jun 26, 2020 · 1 comment

Comments

@eurochriskelly
Copy link

eurochriskelly commented Jun 26, 2020

This is something I noticed recently but worked around by removing bad rows (those where the column count does not match the header count) during pre-processing. However, it can still cause issues for other types of bad rows.

Summary

When using mlcp command-line splits, and depending on the size of the split, mlcp can lose data.

This was observed while ingesting different large files (~1M records) with a small percentage of bad records and various split sizes. It was also observed that the number of unaccounted records changes up and down depending on the split size. It depended on whether the split crossed a bad record or not.

Repro

Generate a large csv file which includes randomly broken rows, like this:

H1,H2
a,b
c,d
d,e,f   #Column number mis-match
g,h,
etc..

Note: longer bad rows are better for reproducing the issue.

If the split boundary occurs on a broken row, that row is lost without being reported.
Changing the split size will change the number of rows that are lost without being reported.
Removing the split option will skip the bad rows but they will be reported and everything is accounted for.

The result is that when checking the mlcp log, the totals + skipped do not match the actual number of records in the file. It can seem like everything was successfully ingested because the skips are silently dropped.

This has been tested with several recent versions of mlcp.

@jmakeig jmakeig added the bug label Jun 26, 2020
@yunzvanessa yunzvanessa added this to the 10.0.6 milestone Sep 12, 2020
@yunzvanessa yunzvanessa modified the milestones: 10.0.6, 10.0.7 Jan 28, 2021
@yunzvanessa yunzvanessa modified the milestones: 10.0.7, 10.0.8 May 22, 2021
@yunzvanessa yunzvanessa assigned abika5 and unassigned yunzvanessa Sep 27, 2021
@yunzvanessa yunzvanessa modified the milestones: 10.0.8, 10.0.9 Sep 27, 2021
@yunzvanessa yunzvanessa added verify and removed new labels Oct 18, 2021
@yunzvanessa
Copy link
Contributor

Hi eurochriskelly,

Thank you for filing this issue! I'm wondering whether you are able to provide sample data for us to reproduce the bug?

Thanks,
Vanessa

@abika5 abika5 modified the milestones: 10.0.9, 10.0-10 Jan 28, 2022
@yunzvanessa yunzvanessa modified the milestones: 11.0.0, 11.1.0 May 15, 2023
@abika5 abika5 modified the milestones: 11.1.0, 11.2.0 Jan 3, 2024
@abika5 abika5 modified the milestones: 11.3.0, 11.4.0 Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants