Skip to content

Latest commit

 

History

History
61 lines (47 loc) · 2.08 KB

FILTERS.md

File metadata and controls

61 lines (47 loc) · 2.08 KB

Filters

char_length

Removes lines outside of a certain character length

  • min (int) : Minimum length (inclusive)
  • max (int) : Maximum length (inclusive)

characters_count_mismatch

Removes lines when the sum of certain characters between source and target is not the same.

  • chars (str) : Characters to check (()[]?!:."“”{})

contains

Removes lines that contain these words

  • words (list(str)) : List of words

digits_mismatch

Removes lines when there are digits in source and not in target, or vice-versa

digits_ratio

Removes lines when the ratio of numerical characters to the total length of the line is greather than max.

  • max (float) : Maximum ratio (0.4)

duplicates

Remove lines when source is the same as target

excerpt

Selects a partial dataset located between top % and bottom % of a large dataset (useful with very large ones).

  • top_percentile (float) : dataset percentile where data collection begins
  • bottom_percentile (float) : percentile where data collection ends

first_char_mismatch

Removes lines when the first character is a letter but the case is mismatched, or the first character in source is not the same as the first character in target.

nonalphanum_count_mismatch

Removes lines when the sum of non-alphanumeric characters (except spaces) between source and target is not the same

nonalphanum_ratio

Removes lines when the ratio of non-alphanumeric characters to the total length of the line is greather than max.

  • max (float) : Maximum ratio (0.4)

source_target_ratio

Removes lines when the ratio (len(source) / len(target)) is outside of bounds

  • min (float) : Lower bound (inclusive)
  • max (float) : Upper bound (inclusive)

top

Only add the top X% lines from the dataset

  • percent (float) : Percentage of dataset to include

uppercase_count_mismatch

Removes lines when source and target have a different number of uppercase letters