Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: native CSV parsing #4073

Merged
merged 6 commits into from
Aug 11, 2022
Merged

feat: native CSV parsing #4073

merged 6 commits into from
Aug 11, 2022

Conversation

HollowMan6
Copy link
Contributor

@HollowMan6 HollowMan6 commented Jul 27, 2022

Previous reviews: HollowMan6#1

A project of Reinforcement Learning Open Source Fest 2022.

Tutorial CSV

Iris.csv
Iris.txt
iris2.csv

Build CSV parser with cmake option -DBUILD_CSV=On

vw -d Iris.txt
vw --csv -d Iris.csv
vw --csv -d Iris2.csv  --named_labels Setosa,Versicolor,Virginica --oaa 3

Summary of features supported and tested:

  1. Allows specifying the CSV field separator by --csv_separator, default is ,, but " | or : are reserved and not allowed to use, since the double quote (") is for escape, vertical bar(|) for separating the namespace and feature names, : can be used in labels.
  2. For each separated field, auto remove the outer double-quotes of a cell when it pairs. --csv_separator symbols that appeared inside the double-quoted cells are not considered as a separator but a normal string character.
  3. Double-quotes that appear at the start and end of the cell will be considered to enclose fields. Other quotes that appear elsewhere and out of the enclose fields will have no special meaning. (This is also how Microsoft Excel parses.)
  4. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote, and will remove that escape symbol during parsing.
  5. Use header line for feature names (and possibly namespaces) / specify label and tag using _label and _tag by default. For each separated field in header except for tag and label, it may contain namespace and feature name separated by namespace separator, vertical bar(|).
  6. --csv_header to override the CSV header by providing (namespace, | and) feature name separated with ,. By default, CSV files are assumed to have a header with feature and/or namespaces names in the CSV first line. You can override it by specifying --csv_header. Combined with --csv_no_file_header, we assume that there is no header in the CSV file and under such condition specifying --csv_header for the header is a must.
  7. If the number of the separated fields for current parsing line is greater than the header, an error will be thrown.
  8. Trim the field for ASCII "white space"(\r\n\f\v) as well as some UTF-8 BOM characters(\xef\xbb\xbf) before separation.
  9. If no namespace is separated, will use empty namespace.
  10. Separator supports using \t to represent tabs. Otherwise, if assigning more than one character, an error will be thrown.
  11. Directly read the label as string, interpret it using the VW text label parser.
  12. Will try to judge if the feature values are float or string, if NaN, will consider it as a string. quoted numbers are always considered as strings.
  13. If the feature value is empty, will skip that feature.
  14. Reset the parser when EOF of a file is met (for possible multiple input file support).
  15. Support using --csv_ns_value to scale the namespace values by specifying the float ratio.
    e.g. --csv_ns_value=a:0.5,b:0.3,:8, which the namespace a has a ratio of 0.5, b of 0.3, empty namespace of 8, other namespaces of 1.
  16. If all the cells in a line is empty, then consider it as an empty line. CSV is not a good fit for the multiline format, as evidenced by the large number of empty fields. Multi-line format often means different lines have different schemas. However, I still leave the empty line support to make sure that it’s flexible and extendable enough. In this case, users can still express multi-line examples in CSV files, although it is not listed as supported. We still throw an error if the number of fields separated by the line doesn’t match previous, even all the fields are empty, as this usually means typos that users may not intend.

@rajan-chari
Copy link
Member

Looks very useful. Thanks!

@rajan-chari
Copy link
Member

Although not explicitly supported, I am wondering if having a multiline csv example in the test folder would be useful reference.

@HollowMan6 HollowMan6 force-pushed the csv-pr branch 2 times, most recently from 8786329 to d7e5e91 Compare August 5, 2022 04:42
@HollowMan6
Copy link
Contributor Author

Although not explicitly supported, I am wondering if having a multiline csv example in the test folder would be useful reference.

@rajan-chari Thanks for reviewing, I have added that in d7e5e91

@HollowMan6 HollowMan6 force-pushed the csv-pr branch 2 times, most recently from 12c7993 to 53d8632 Compare August 10, 2022 01:01
CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
vowpalwabbit/csv_parser/src/parse_example_csv.cc Outdated Show resolved Hide resolved
Copy link
Collaborator

@olgavrou olgavrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

<3 the extensive test coverage

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
@HollowMan6
Copy link
Contributor Author

Just rebased the PR and it seems like that the CI error is not cause by this PR but recent commits in the master, as that passes for the previous run https://github.com/VowpalWabbit/vowpal_wabbit/actions/runs/2833443417

@jackgerrits
Copy link
Member

I agree, that break is not introduced by this PR. It snuck in in #4109.

@jackgerrits
Copy link
Member

Awesome stuff, merging now :)

@jackgerrits jackgerrits merged commit 54ee37f into VowpalWabbit:master Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants