ML-GARDEN

ml-garden is a pipeline library that simplifies the creation and management of machine learning projects. It offers a high-level interface for defining and executing pipelines, allowing users to focus on their projects without getting lost in details. It currently supports XGBoost models for regression tasks on tabular data, with plans to expand support for more models in the future. The key components of the pipeline include Pipeline Steps, which are predefined steps connected to pass information through a data container; a Config File for setting pipeline steps and parameters; and a Data Container for storing and transferring essential data and results throughout the pipeline, facilitating effective data processing and analysis in machine learning projects.

Warning

Please be advised that this library is currently in the early stages of development and is not recommended for production use at this time. The API and functionality of the library may undergo changes without prior notice. This library was developed as part of a pro bono collaboration project with the Open Collaboration Foundation (OCF). As such, the development of the library is a work in progress, and both its implementation and API are subject to change. Kindly use the library at your own discretion and be aware of the associated risks.

Features

Intuitive and easy-to-use API for defining pipeline steps and configurations
Support for various data loading formats, including CSV and Parquet
Flexible data preprocessing steps, such as data cleaning, feature calculation, and encoding
Seamless integration with XGBoost for model training and prediction
Hyperparameter optimization using Optuna for fine-tuning models
Evaluation metrics calculation and reporting
Explainable AI (XAI) dashboard for model interpretability
Extensible architecture for adding custom pipeline steps

Installation

To install the Pipeline Library, you need to have Python 3.9 or higher and Poetry installed. Follow these steps:

Clone the repository:

git clone https://github.com/tryolabs/ml-garden.git

Navigate to the project directory:
```
cd ml-garden
```
Install the dependencies using Poetry:
```
poetry install
```
If you want to include optional dependencies, you can specify the extras:
```
poetry install --extras "xgboost"
```
or
```
poetry install --extras "all_models"
```

Usage

Here's an example of how to use the library to run an XGBoost pipeline:

Create a config.json file with the following content:

{
  "pipeline": {
    "name": "XGBoostTrainingPipeline",
    "description": "Training pipeline for XGBoost models.",
    "parameters": {
      "save_data_path": "ames_housing.pkl",
      "target": "SalePrice",
      "tracking": {
        "experiment": "ames_housing",
        "run": "baseline"
      }
    },
    "steps": [
      {
        "step_type": "GenerateStep",
        "parameters": {
          "train_path": "examples/ames_housing/data/train.csv",
          "predict_path": "examples/ames_housing/data/test.csv",
          "drop_columns": ["Id"]
        }
      },
      {
        "step_type": "TabularSplitStep",
        "parameters": {
          "train_percentage": 0.7,
          "validation_percentage": 0.2,
          "test_percentage": 0.1
        }
      },
      {
        "step_type": "CleanStep"
      },
      {
        "step_type": "EncodeStep"
      },
      {
        "step_type": "ModelStep",
        "parameters": {
          "model_class": "XGBoost"
        }
      },
      {
        "step_type": "CalculateMetricsStep"
      },
      {
        "step_type": "ExplainerDashboardStep",
        "parameters": {
          "enable_step": false
        }
      }
    ]
  }
}

Run the pipeline in train mode using the following code:

import logging

from ml_garden import Pipeline

logging.basicConfig(level=logging.INFO)

data = Pipeline.from_json("config.json").train()

Run the pipeline for inference using the following code:

data = Pipeline.from_json("config.json").predict()

You can also set the prediction data as a DataFrame:

data = Pipeline.from_json("config.json").predict(df)

This will use the DataFrame provided in code, not needing the predict_path file in the configuration parameters for the Generate step.

The library allows users to define custom steps for data generation, cleaning, and preprocessing, which can be seamlessly integrated into the pipeline.

Optuna dashboard

For hyperparameter tuning runs, you can run the Optuna Dashboard to check the status of hyperparameter tuning runs with this command:

optuna-dashboard sqlite:///db.sqlite3

MLFlow Experiment Tracking

You can locally host an MLFlow server to track your experiments by running

mlflow server --host 0.0.0.0 --port 5000

If you're within Tryolabs' VPN you can also use the MLFlow server hosted within our servers:

http://192.168.10.241:49420

Performance and Memory Profiling

We've added pyinsytrument and memray as development dependencies for optimizing performance and memory usage of the library. Refer to the tools documentation for usage notes:

Contributing

Contributions to the Pipeline Library are welcome! If you encounter any issues, have suggestions for improvements, or want to add new features, please open an issue or submit a pull request on the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 209 Commits
.github		.github
.vscode		.vscode
documentation		documentation
examples		examples
ml_garden		ml_garden
tests		tests
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
run.py		run.py
run_interactive.py		run_interactive.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML-GARDEN

Features

Installation

Usage

Optuna dashboard

MLFlow Experiment Tracking

Performance and Memory Profiling

Contributing

About

Releases

Packages

Contributors 4

Languages

tryolabs/ml-garden

Folders and files

Latest commit

History

Repository files navigation

ML-GARDEN

Features

Installation

Usage

Optuna dashboard

MLFlow Experiment Tracking

Performance and Memory Profiling

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages