Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better way to ignore columns when running a report #548

Open
npatki opened this issue Apr 2, 2024 · 1 comment
Open

Better way to ignore columns when running a report #548

npatki opened this issue Apr 2, 2024 · 1 comment
Labels
feature request Request for a new feature

Comments

@npatki
Copy link
Contributor

npatki commented Apr 2, 2024

Problem Description

As described in #546, I may want to ignore certain columns in a dataset when running a report (quality or diagnostic). It is not completely intuitive how to do this.

  1. The metadata requires that all columns be described. So you cannot ask a report to ignore a column simply by removing it from the metadata.
  2. It is unclear from the metadata spec which columns will be ignored and which will be used for evaluation

Actual Solution: If you mark a column with an "other" sdtype (not categorical, numerical, datetime, etc.), then SDV will assume it is non-statistical pii and therefore ignore the column. For example, using sdtype 'text' is sufficient to get a report to ignore the column.

Expected behavior

The metadata spec should probably remain as-is, because in the future we may decide to add metrics for specific sdtypes.

However, perhaps the report itself should allow you to specify which columns to ignore?

@npatki npatki added the feature request Request for a new feature label Apr 2, 2024
@srinify
Copy link

srinify commented Aug 16, 2024

Another use case: the visualization phase after a Quality Report is generated.

If a table has a large number of columns, the generated visualizations become hard to interact with and use for insight gathering. This is an example from the loan_applications dataset:

Screenshot 2024-08-16 at 10 59 00 AM

If I want to focus on ~10 columns in the Quality Report, not an easy way to do this natively. Potential solutions here could either manifest as:

  • ignoring columns in Quality Report
  • or ignoring columns in viz generation after full Quality Report is generated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants