Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance: Fix Data Quality with Outlier Handling and Improved Missing Value Treatment #207

Merged
merged 14 commits into from
Jul 29, 2024

Conversation

MooooCat
Copy link
Contributor

Description

This pull request introduces some enhancements to the Synthetic Data Generator (SDG) framework, focusing on improving data quality and handling of specific data anomalies. The key changes include:

  1. Introduction of OutlierTransformer: A new transformer class designed to handle outliers in the data by converting them to specified fill values. This class is equipped to manage outliers in both integer and float columns, replacing them with default fill values (0 for integers and 0.0 for floats).

  2. Enhancements to NonValueTransformer: The NonValueTransformer class has been updated to better handle missing values in a DataFrame. It now differentiates between numeric and non-numeric columns, filling missing values in numeric columns with specified numeric defaults (0 for integers, 0.0 for floats) and non-numeric columns with a default string ('NAN_VALUE').

  3. Documentation Updates: Comprehensive docstrings have been added to both the OutlierTransformer and NonValueTransformer classes, providing clear descriptions of their functionalities, attributes, and methods.

  4. Manager Registration: The OutlierTransformer has been registered with the DataProcessorManager, ensuring it can be utilized within the SDG framework.

  5. Regex Inspector Parameter Update: A minor update to the Regex Inspector's fit method to change the parameter name from raw_data to input_raw_data for clarity and consistency.

  6. DiscreteTransformer Registration: DiscreteTransformer is currently disabled.

  7. Test Cases for OutlierTransformer: Added test cases to validate the functionality of the OutlierTransformer, including handling of outliers in integer and float columns.

Motivation and Context

This change is required to enhance the robustness and reliability of the SDG, particularly in scenarios where data contains outliers or missing values.

By introducing the OutlierTransformer and enhancing the NonValueTransformer, we ensure that the generated synthetic data is of higher quality, suitable for a wider range of applications, and more representative of real-world data anomalies.

How has this been tested?

The changes have been thoroughly tested using automated test cases. Specifically:

  • OutlierTransformer: Test cases were designed to validate the handling of outliers in integer and float columns, ensuring they are replaced with the correct fill values.
  • NonValueTransformer: Tests were conducted to verify the differentiation and appropriate filling of missing values in numeric and non-numeric columns.

Types of changes

  • Maintenance (no change in code, maintain the project's CI, docs, etc.)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@MooooCat MooooCat merged commit 16825af into main Jul 29, 2024
12 checks passed
@MooooCat MooooCat deleted the screenshot-demo branch July 29, 2024 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant