Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

README fixes. #122

Merged
merged 1 commit into from
May 8, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 17 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,35 +11,38 @@ ArangoML Pipeline is a common and extensible Metadata Layer for Machine Learning
</center>

**News:**
[ArangoML Pipeline Cloud](https://www.arangodb.com/2020/01/arangoml-pipeline-cloud-manage-machine-learning-metadata/) is offering a no-setup, free-to-try managed service for ArangpML Pipeline. A [ArangoML Pipeline Cloud tutorial](https://colab.research.google.com/github/arangoml/arangopipe/blob/master/examples/Arangopipe_with_TensorFlow_Beginner_Guide.ipynb#) is also available without any installation or Signup.
[ArangoML Pipeline Cloud](https://www.arangodb.com/2020/01/arangoml-pipeline-cloud-manage-machine-learning-metadata/) is offering a no-setup, free-to-try managed service for ArangpML Pipeline. A [ArangoML Pipeline Cloud tutorial](https://colab.research.google.com/github/arangoml/arangopipe/blob/master/examples/Arangopipe_with_TensorFlow_Beginner_Guide.ipynb#) is also available without any installation or signup.

## Quick Start
To get started with no installations of any sort (using ArangoML Pipeline Cloud)
To get started without any installations (using ArangoML Pipeline Cloud)
, click :
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arangoml/arangopipe/blob/master/examples/Arangopipe_Feature_Examples.ipynb)

The [examples folder](https://github.com/arangoml/arangopipe/tree/master/examples) contains examples more example notebooks that illustrate the features of **Arangopipe**.
The [examples folder](https://github.com/arangoml/arangopipe/tree/master/examples) contains notebooks that illustrate the features of **Arangopipe**.

## Overview
When machine learning pipelines are created, for example using [TensorFlow Extended](https://www.tensorflow.org/tfx/guide) or [Kubeflow](https://www.kubeflow.org/),
the capture (and access to) of metadata across the pipeline is vital. Typically, each component of such an ML pipeline produces or requires metadata, for example:

## Introduction
When productizing Machine Learning Pipelines (e.g., [TensorFlow Extended](https://www.tensorflow.org/tfx/guide) or [Kubeflow](https://www.kubeflow.org/))
the capture (and access to) of metadata across the pipeline is vital. Typically, each of the components of such ML pipeline produces/requires Metadata, for example:
* Data storage: size, location, creation date, checksum, ...
* Feature Store (processed dataset): transformation, version, base datasets ...
* Model Training: training/validation performance, training duration, ...
* Model Serving: model linage, serving performance, ...

Instead of each component storing its own metadata, a common Metadata Layer allows for queries across the entire pipeline and more efficient management.
[**ArangoDB**](https://www.arangodb.com) being a multi model database supporting both efficient document and graph data models within a single database engine is a great fit for such kind of common metadata layer for the following reasons:
Instead of each component storing its metadata, a common metadata layer simplifies data management and permits querying the entire pipeline.
[**ArangoDB**](https://www.arangodb.com), being a multi model database, supporting both efficient document and graph data models within a single database engine, is a great fit for such a metadata layer, for the following reasons:

* The metadata produced by each component is typically unstructured (e.g., TensorFlow's training metadata is different from PyTorch's metadata) and hence a great fit for document databases
* The relationship between the different entities (i.e., metadata) can be neatly expressed as graphs (e.g., this model has been trained by *run_34* on *dataset_y*)
* Querying the metadata can be easily expressed as a graph traversal (e.g., all models which have been derived from *dataset_y*)
* Metadata queries are easily expressed as graph traversals (e.g., all models which have been derived from *dataset_y*)

## Use Cases
ArangoML Pipeline benefits many different scenarios including:
* Capture of Lineage Information (e.g., Which dataset influences which Model?)
* Capture of Audit Information (e.g, A given model was training two months ago with the following training/validation performance)
* Reproducible Model Training
* Model Serving Policy (e.g., Which model should be deployed in production based on training statistics)
ArangoML Pipeline can benefit many scenarios, such as:

* Capture of lineage information (e.g., Which dataset influences which model?)
* Capture of audit information (e.g, A given model was training two months ago with the following training/validation performance)
* Reproducible model training
* Model serving policy (e.g., Which model should be deployed in production based on training statistics)
* Extension of existing ML pipelines through simple python/HTTP API

## Documentation
Expand Down