In the ancient Roman mythology, the god Mercury was known as the messenger of the god. Wearing winged shoes and a winged hat he zipped between Mount Olympus and the kingdoms of men and saw to it that the will of the gods was known.
We've chosen mercury-ml
as the name of this package because we see its role as very similar.
Recent developments in Machine Learning and Data Processing tools have led to a myriad of except open source libraries
each of which provide well developed and transparent APIs. Where it becomes more complicated is when functions from
different libraries need to be strung together to form a machine learning workflow. mercury-ml
is "a messenger of the gods"
that enables this to happen. It seeks to break down a Machine Learning project into its typical generic components
(such as read data
, transform data
, fit model
, evaluate model
etc.) and offers a generic modular structure where
implementations for specific methods and technologies can slot in.
These broken down components can then be chained together into a coherent, easily configurable workflow for fitting, evaluating and (coming soon!) serving Machine Learning models.
The package is split firstly into two broad sections:
common
: where functions and classes that are commonly useful, regardless of the machine learning engine used, are found- Then there are sections that are specific to machine learning engines (currently this includes
h2o
andkeras
).
Within these sections there is a further subdivision into three APIs:
providers
: The individual (modular) building-blocks used to build up a workflowcontainers
: IoC containers over which the desired providers can be fetched.tasks
: Small predefined chunks of work, mostly stringing together a handful of logical steps to be executed by various providers
You can interact with ML-Mercury
via any or all of these APIs. They deliver different levels of abstraction depending
on what you need. You can also easily mix in your own custom providers.
To understand the purpose and function of each individual provider, container and task, please refer to the mercury-ml
API documentation.
Since mercury-ml
functions as a facilitator for workflows based on various different packages its dependencies will
be determined by which functions are used. The core dependencies have been minimized to only a handful of packages.
python>=3.5
pandas
numpy
scikit-learn
tensorflow
ortensorflow-gpu
keras
Pillow
h2o
pyspark
h2o-pysparkling
- AWS S3:
boto3
- GCP Cloud Storage:
google-cloud-storage
mercury-ml
can be installed from PyPi with pip install mercury-ml
. This will also install the core dependencies. To additionally
install extra dependencies, you can use the following hints:
pip install mercury-ml[keras]
will install [tensorflow
,keras
,pillow
]pip install mercury-ml[keras-gpu]
will install [tensorflow-gpu
,keras
,pillow
]pip install mercury-ml[h2o]
will install [h2o
]pip install mercury-ml[h2o-sparkling]
will install [h2o
,pyspark
,h2o-pysparkling
]pip install mercury-ml[s3]
will install [boto3
]pip install mercury-ml[gcs]
will install [google-cloud-storage
]
mercury-ml
aims to offer simplified access to functionality at different levels of abstraction.
Below are four simple examples that each do the same thing: save a Keras model to S3. They do so at different levels of abstraction:
- Without using
mercury-ml
(i.e. directly using the underlying dependencies) - Using the
providers
API - Using the
containers
API - Using the
tasks
API (in conjunction with thecontainers
API)
Each of these examples are perfectly valid, though in certain circumstances one may make more sense than the other.
For more complete examples, please see the examples
directory in this repository.
Let's assume we have the following inputs:
model = ... # assume a fitted Keras model fetched here
filename = "my_model"
local_dir = "./local_models"
extension = ".h5"
remote_dir = "my-bucket/remote-model"
Using the underlying libraries rather than using the mercury-ml
APIs makes sense when you want the maximum flexibility to
configure how these libraries are used.
import os
# save model
if not os.path.exists(local_dir):
os.makedirs(local_dir)
filename = filename + extension
local_path = os.path.join(local_dir + "/" + filename)
model.save(local_path)
# copy to s3
import boto3
session = boto3.Session()
s3 = session.resource("s3")
s3_bucket, s3_partial_key = remote_dir.split("/", 1)
s3_key = s3_partial_key + "/" + filename + extension
s3.Object(s3_bucket, s3_key).put(Body=open(local_path, "rb"))
Using the providers
API makes the most sense if you want to hardcode the providers you want to use. For example
in the code snippet be, you can only use model_saving.save_keras_hdf5
and from_disk.copy_from_disk_to_s3
. If you
want to save the model in a different format, or copy it to a different store you must change your code to do so.
from mercury_ml.keras.providers import model_saving
from mercury_ml.common.providers.artifact_copying import from_disk
import os
# save model
path = model_saving.save_keras_hdf5(model=model,
filename=filename,
local_dir=local_dir,
extension=extension)
# copy to s3
from_disk.copy_from_disk_to_s3(source_dir=local_dir,
target_dir=remote_dir,
filename=os.path.basename(path))
Using the containers
API makes the most sense when you want to steer your workflow via a configuration file. The containers
are just light-weight classes that allow you to access various similar providers from a single location. For example,
the function used above, model_saving.save_keras_hdf5
can be accessed via a container as ModelSavers.hdf5
. Using the
getattr
function this can also be accessed as getattr(ArtifactCopiers, "hdf5")
allowing us to easily parameterize
this in a config.
from mercury_ml.keras.containers import ModelSavers
from mercury_ml.common.containers import ArtifactCopiers
import os
config = {
"save_model": "hdf5",
"copy_model": "copy_from_disk_to_s3"
}
save_model = getattr(ModelSavers, config["save_model"])
copy_from_local_to_remote = getattr(ArtifactCopiers, config["copy_model"])
# save model
path = save_model(model=model,
filename=filename,
local_dir=local_dir,
extension=extension)
# copy to s3
copy_from_local_to_remote(source_dir=local_dir,
target_dir=remote_dir,
filename=os.path.basename(path)
)
Using the tasks API makes sense when you want to use a single function that defines a small workflow that involves more
than one provider
and requires multiple steps. For example, the store_model
task below is injected with a save_model
and a copy_from_local_to_remote
provider and proceeds to use these providers
first to save a model locally and then
to copy it to a remote location (in this example, to S3)
from mercury_ml.common.tasks import store_model
from mercury_ml.keras.containers import ModelSavers
from mercury_ml.common.containers import ArtifactCopiers
save_model = getattr(ModelSavers, config["save_model"])
copy_from_local_to_remote = getattr(ArtifactCopiers, config["copy_model"])
# save model and copy to s3
store_model(save_model=save_model,
copy_from_local_to_remote=copy_from_local_to_remote,
model=model,
filename=filename,
local_dir=local_dir,
remote_dir=local_dir,
extension=extension
)
It is worth saying a few words about how mercury-ml
deals with data as this forms the backbone of how it is able to
facilitate robust machine learning workflows. There are three concepts to understand:
DataWrapper
. An instance of theDataWrapper
class wraps an underlying structure (for example aPandas DataFrame
,Spark DataFrame
,Numpy Array
orKeras ImageDataGenerator
) into an object that has the following characteristics:- It has the attributes
underlying
(which gives you direct access to the data structure that has been wrapped) andfield_names
, which is a list with the names of the in the underlying data (field_names
is not always relevant, and may be set toNone
). - It has various functions that transform from one DataWrapper to another. For example,
PandasDataWrapper.to_numpy()
will yield an instance ofNumpyDataWrapper
- It has the attributes
DataSet
. An instance of theDataSet
class is a container for various instances ofDataWrapper
. ADataSet
will typically consist ofDataWrappers
forfull_data
,index
,features
andtargets
but this is not predefined. It also contains some functionality that facilitates the transformation into aDataSet
withDataWrappers
of a different type.DataBunch
. An instance of theDataBunch
class is essentially just a container that holds one or moreDataSet
instances. ADataBunch
will typically consist oftrain
,valid
andtest
DataSets
.
As an example of how this works, let's create a DataBunch
for a model training that uses Pandas DataFrames as inputs:
import pandas as pd
from mercury_ml.common.providers.data_wrappers.pandas import PandasDataWrapper
from mercury_ml.common.providers.data_set import DataSet
from mercury_ml.common.providers.data_bunch import DataBunch
path_to_input_data = "./example_data_train.csv"
full_data_columns=["ID", "field1_num", "field2_num", "field3_factor", "field4_target"] #the full columns relevant to training
index_columns=["ID"] # the columns that make up the unique index
features_columns=["field1_num", "field2_num", "field3_factor"] # the columns to be used as features when training
targets_columns=["field6_target"] # the columns with the targets to be trained on
df = pd.read_csv(path_to_input_data, usecols=full_data_columns)
train_data_set = DataSet(data_wrappers_dict={
"full_data": PandasDataWrapper(underlying=df, field_names=full_data_columns),
"index": PandasDataWrapper(underlying=df[index_columns], field_names=index_columns),
"features": PandasDataWrapper(underlying=df[features_columns], field_names=features_columns),
"targets": PandasDataWrapper(underlying=df[targets_columns], field_names=targets_columns)
})
data_bunch = DataBunch(data_sets_dict={
"train": train_data_set
})
You could also add additional DataSets to the DataBunch, either when initially constructing:
data_bunch = DataBunch(data_sets_dict={
"train": train_data_set,
"valid": valid_data_set,
"test": test_data_set
})
Or afterwards via DataBunch.add_data_set
:
from mercury_ml.common.providers.data_bunch import DataBunch
data_bunch = DataBunch()
data_bunch.add_data_set(data_set_name="train", data_set=train_data_set)
data_bunch.add_data_set(data_set_name="valid", data_set=valid_data_set)
data_bunch.add_data_set(data_set_name="test", data_set=test_data_set)