Skip to content

Latest commit

 

History

History
28 lines (28 loc) · 2.11 KB

ROADMAP.md

File metadata and controls

28 lines (28 loc) · 2.11 KB

🗺️ Public Roadmap

  • Detection of problematic data slices
  • Basic explanation of found issues via feature importances
  • Limited embedding computation for images, audio, text
  • Extended embedding support, e.g., more embedding models and allow precomputed embeddings
  • Speed up embedding computation using datasets library
  • Improved issue detection algorithm, avoiding duplicate detections of similar problems and outliers influencing the segment detection
  • Support application on datasets without labels (outlier based)
  • Adaptive drop reference for datasets that contain a wide variety of data
  • Large data support for detection and reporting, e.g., 500k audio samples with transcriptions
  • Different interfaces from min_drop, min_support. Maybe n_slices and sort by criterion?
  • Support application without model (by training simple baseline model)
  • Improve normalization for mixed type runs e.g. embedding + one categorical or numeric variable.
  • Walthroughs for unstructured, structured and mixed data. Also, in depth tutorial explaining all the parameters.
  • Soft Dependencies for embedding computation and autml as torch and xgboost dependencies are large
  • Per use case helpers such as find_issues_object_detection, find_issues_ts_forecasting, ...
  • Allow for model comparisons via intersection, difference, ...
  • Allow application of sliceguard on timeseries
  • Add Sliceguard deepdive notebook to show more advanced usage
  • Build sphinx docs
  • Stronger automated testing
  • Robustify outlier detection algorithm. Probably better parameter choice.
  • Interpretable features for images, audio, text. E.g., dark image, quiet audio, long audio, contains common word x, ...
  • Generation of a summary report doing predefined checks
  • "Supervised" clustering that incorporates classes, probabilities, metrics, not only features
  • Data connectors for faster application on common data formats
  • Support embedding generation for remote resources, e.g. audio/images hosted on webservers
  • Improved explanations for found issues, e.g., via SHAP