PyD2M is a dependency management tool for pandas DataFrames. In brief, it does two things:
- Automatically extract and join data fields from different data files (in different formats: csv, msgpack, etc.)
- Generate new data and save them to files according to some pre-defined recipes
Here, we use PyD2M to manage a dataset used for the simulation of a transshipment container port. In a container port, a transshipment container will be discharged from the first vessel, stored somewhere in the terminal's storage yard for a few days, and then be loaded to its second vessel.
Initially, suppose we have a vessel_info.csv
file generated by some other tool,
which contains a port's vessel visiting information within one month.
There are three columns in this file: the vessel IDs, vessel lengths and
their arrival times. Each vessel visits the port in a periodic
pattern; thus, a VesselID may appear multiple times in the
file. On the other hand, the ArrivalTime is the number of seconds since the beginning of the
month.
VesselId,Length,ArrivalTime
7,240,7148
2,195,9907
0,195,32867
6,216,47490
8,319,49142
3,182,59537
...
To use PyD2M, let's create a directory named dataset
and put
vessel_info.csv
under dataset/raw
. Also, we create a sub-directory
conf
and a file d2m.rc
.
dataset
+-- conf
| +-- d2m.rc
+-- raw
| +-- vessel_info.csv
- Note
conf/d2m.rc
is the default location for a PyD2M dataset's configuration, which is in YAML format.
Now, add the following context to d2m.rc
:
- DATA:
raw:
vessel_info.csv:
TYPE: csv
DECLARE_NEW_FIELDS: True
LOCAL_FIELDS_ONLY: False
FREE_FIELDS: False
FIELDS:
- VesselID: str
- Length: int
- ArrivalTime: int
Okay, these data is ready to be managed by PyD2M!
To use PyD2M, first, we import the module and create a DataSource
object.
>>> from pyd2m.datasource import DataSource
>>> ds = DataSource("./dataset")
Then, we can use the DataSource.load
method to load the file as a
pandas DataFrame.
>>> ds.load("raw/vessel_info.csv")
VesselID Length ArrivalTime
0 4 344 7257
1 6 294 16757
2 8 339 18753
3 7 298 31737
4 0 318 44082
5 9 270 49759
...
Or, we can directly use the field names to load the data:
>>> ds["VesselID", "Length", "ArrivalTime"]
Base: raw/vessel_info.csv
VesselID Length ArrivalTime
0 4 344 7257
1 6 294 16757
2 8 339 18753
3 7 298 31737
4 0 318 44082
5 9 270 49759
...
Also, we can access only partial of the data and change the order of fields:
>> ds["ArrivalTime", "Length"]
Base: raw/vessel_info.csv
ArrivalTime Length
0 7257 344
1 16757 294
2 18753 339
3 31737 298
4 44082 318
5 49759 270
...
However, one may find the original data is not so convenient for data analytics.
For example, we would like the ArrivalTime
to be in the more human-readable
datatime64
format instead of the integers. And also, we would like to generate
a unique VesselArrivalID
for each visit of the vessels, which is defined as
{MMDD}V{VesselID}
.
This can be done by add a hook to the file. First, we create a
hook file named vessels.hk
and put it into the conf
sub-directory.
dataset +-- conf | +-- d2m.rc | +-- vessels.hk +-- raw | +-- vessel_info.csv
- Note: The names of the hook file can be arbitrary, as long as their
extension names are
.hk
. There can be more than one.hk
files in theconf
directory.
Add the following context to the vessels.hk
file.
from pyd2m import hooks
import pandas as pd
@hooks.load("raw/vessel_info.csv")
def vel_load_hook(df):
df.ArrivalTime = pd.to_timedelta(df.ArrivalTime, unit="s") + pd.to_datetime("2019")
df["VesselArrivalID"] = df.ArrivalTime.dt.strftime("%m%d") + "V" + df.VesselID.astype(str)
return df
Also, do not forget to change the field ArrivalTime
's type and add
the field VesselArrivalID
in the configuration file.
- DATA:
raw:
vessel_info.csv:
TYPE: csv
DECLARE_NEW_FIELDS: True
LOCAL_FIELDS_ONLY: False
FREE_FIELDS: False
FIELDS:
- VesselID: str
- Length: int
- ArrivalTime: datetime64[s]
- VesselArrivalID: str
Let's try loading this file/fields again.
>>> ds = DataSource("./dataset")
>>> ds["VesselArrivalID", "ArrivalTime", "Length"]
Base: raw/vessel_info.csv
VesselArrivalID ArrivalTime Length
0 0101V7 2019-01-01 04:13:07 101
1 0101V0 2019-01-01 06:03:26 335
2 0101V5 2019-01-01 06:27:19 150
3 0101V3 2019-01-01 12:34:50 314
4 0101V2 2019-01-01 14:19:19 396
5 0101V8 2019-01-01 14:38:38 140
You may have noticed that, in the vel_load_hook
function, column ArrivalTime
of the transformed DataFrame actually has the type of datetime64[ns]
. However,
since we have declare the ArrivalTime
to be datetime64[s]
in the configuration,
its type has already been auto-converted during th loading!
Now, suppose we have another csv file box_info.csv
records the containers'
information, including their BoxID
, UnloadingVesselArrivalID
and
LoadingVesselArrivalID
. Let's add this file's information in the configuration
file as well.
- DATA:
raw:
vessel_info.csv:
...
box_info.csv:
TYPE: csv
DECLARE_NEW_FIELDS: True
LOCAL_FIELDS_ONLY: False
FREE_FIELDS: False
FIELDS:
- BoxID: str
- UnloadingVesselArrivalID: str
- LoadingVesselArrivalID: str
- Note We can also define a
DEFAULT
section in the configuration file and move the common attributes of different files into it, so that we don't need to declare them every time.
- DEFAULTS:
TYPE: csv
DECLARE_NEW_FIELDS: True
LOCAL_FIELDS_ONLY: False
FREE_FIELDS: False
- DATA:
raw:
vessel_info.csv:
FIELDS:
- VesselID: str
- Length: int
- ArrivalTime: datetime64[s]
- VesselArrivalID: str
box_info.csv:
FIELDS:
- BoxID: str
- UnloadingVesselArrivalID: str
- LoadingVesselArrivalID: str
When a vessel arrives at the container port, the berth planner will decide when and where the vessel can be berthed. Here, we will not dive into the complicated vessel berthing algorithms. Let's write a simple random function to berth the vessels --- making each vessel wait for random time within 2 hours after its arrival, and then put it on a random position along the linear quay of the port. After being berthed, each vessel will have a random handling time between 4 and 12 hours.
First, let's define some const values in the dataset's configuration file.
- PARAMS:
QUAY_LENGTH: 3000
MAX_WAITING_TIME: 7200
After that, these consts can be accessed via ds.{var_name}
.
The simple random strategy is implemented as follows.
>>> df = ds["VesselArrivalID", "Length", "ArrivalTime"]
>>> df["MooringPosition"] = df.apply(lambda v: np.random.randint(0, ds.QUAY_LENGTH - v.Length), axis=1)
>>> df["MooringTime"] = df.ArrivalTime + pd.to_timedelta(np.random.random(size=len(df)) * ds.MAX_WAITING_TIME, unit="s")
>>> df["HandlingTime"] = pd.to_timedelta(np.random.uniform(4, 12, size=len(df)), unit="h")
>>> df["HandlingTime"] = pd.to_timedelta(np.random.uniform(4, 12, size=len(df)), unit="h")
Now, we can save the berthing plan into another file. Let's declare a file berthing.msg
in the configuration file. This time, we want it to be in the msgpack
format.
- DATA:
raw:
...
plan:
berthing.msg:
TYPE: msgpack
FIELDS:
- VesselArrivalID: str
- MooringPosition: int
- MooringTime: datetime64[s]
- HandlingTime: timedelta64[s]
Once the file is declared, use the DataSource.dump
to save the data.
>> ds.dump("plan/berthing.msg", df)
- Tips: a hook can also be added to the dumping process. Just use the
@hooks.dump
to decorate a function in any.hk
files.
Once the file exists, we can use load
to load it again, or access its
fields directly.
>>> ds.load("plan/berthing.msg", df)
VesselArrivalID MooringPosition MooringTime HandlingTime
0 0101V8 816 2019-01-01 02:25:22 04:42:51
1 0101V7 243 2019-01-01 05:57:56 07:32:13
2 0101V0 630 2019-01-01 06:12:13 07:12:19
3 0101V9 140 2019-01-01 07:51:36 11:30:00
4 0101V6 570 2019-01-01 09:29:30 07:43:05
5 0101V3 588 2019-01-01 13:05:37 08:27:04
...
>>> ds["VesselArrivalID", "MooringTime"]
Base: plan/berthing.msg
VesselArrivalID MooringTime
0 0101V8 2019-01-01 02:25:22
1 0101V7 2019-01-01 05:57:56
2 0101V0 2019-01-01 06:12:13
3 0101V9 2019-01-01 07:51:36
4 0101V6 2019-01-01 09:29:30
5 0101V3 2019-01-01 13:05:37
...
Now, what if we want to analyse the relationships between vessels' lengths
and their handling times? To do so, we need the values of both Length
and HandlingTime
. However, there are in different files with different
format! Do we need to load these two files seperately and join them manually?
The answer is of course not. Instead, we can retrieve these fields directly.
>>> ds["VesselArrivalID", "Length", "HandlingTime"]
Base: raw/vessel_info.csv
Joining: plan/berthing.msg
VesselArrivalID Length HandlingTime
0 0101V8 345 09:12:38
1 0101V7 217 09:47:53
2 0101V0 293 08:55:18
3 0101V9 177 05:42:23
4 0101V6 115 10:06:20
5 0101V3 375 10:11:21
...
See? PyD2M has done this joining automatically!
A cookbook contains a series of recipes. A recipe is a function which generates new DataFrames (dishes) from exists DataFrames (ingredients).
Now, let's generate each container's unloading/loading time and position
at the quay according to the vessel information. The results will be saved
in plan/box_pos_time.msg
. First, add the file's information in the
configuration file.
- DATA:
...
plan:
...
box_pos_time.msg:
TYPE: msgpack
FIELDS:
- BoxID: str
- UnloadingPosition: int
- UnloadingTime: datetime64[s]
- LoadingPosition: int
- LoadingTime: datetime64[s]
Then, create a file plan.cb
in the conf
directory and add the following
code to it. The filename can be arbitrary as long as the extension
name is .cb
.
@recipe("plan/box_pos_time.msg")
def gen_box_pos_time(cb):
vel_info = cb.DS["VesselArrivalID", "MooringPosition", "Length", "MooringTime", "HandlingTime"]
df_u = cb.DS["BoxID", "UnloadingVesselArrivalID"].merge(
vel_info, left_on="UnloadingVesselArrivalID", right_on="VesselArrivalID")
df_u["UnloadingPosition"] = df_u.Length * np.random.random(size=len(df_u)) + df_u.MooringPosition
df_u["UnloadingTime"] = df_u.HandlingTime * np.random.random(size=len(df_u)) + df_u.MooringTime
df_l = cb.DS["BoxID", "LoadingVesselArrivalID"].merge(
vel_info, left_on="LoadingVesselArrivalID", right_on="VesselArrivalID")
df_l["LoadingPosition"] = df_l.Length * np.random.random(size=len(df_l)) + df_l.MooringPosition
df_l["LoadingTime"] = df_l.HandlingTime * np.random.random(size=len(df_l)) + df_l.MooringTime
return df_u.merge(df_l, on="BoxID")
- Note 1: We can also indicate the recipe's ingredients and dishes manually as follows.
@recipe(ingredients=["raw/vessel_info.csv", "plan/berthing.msg"], dishes=["plan/box_pos_time.msg"])
def gen_box_pos_time(cb, vel, bth):
vel_info = vel.merge(bth, on="VesselArrivalID")
...
In this case, the parameters after cb
are the DataFrames in the
ingredient list, separately. There can also be more than one generated
dishes. If there are multiple dishes, the decorated function should return
a tuple containing all the dishes in order.
- Note 2: When dumping data to a file, PyD2M discards the fields
not defined in the configuration file. So you do not need to call DataFrame's
drop
method manually.
Now, let's try fetching boxes' unloading information directly.
>>> ds["BoxID", "UnloadingVesselArrivalID", "UnloadingPosition", "UnloadingTime]
Generating plan/box_pos_time.msg
[] => ['plan/box_pos_time.msg'] By <CookBook.gen_box_pos_time>
Base: plan/berthing.msg
Joining: raw/vessel_info.csv
Base: raw/box_info.csv
Base: raw/box_info.csv
Base: plan/box_pos_time.msg
Joining: raw/box_info.csv
BoxID UnloadingVesselArrivalID UnloadingPosition UnloadingTime
0 0 0101V0 2510 2019-01-01 13:20:21
1 105 0101V0 2737 2019-01-01 13:54:03
2 142 0101V0 2544 2019-01-01 08:08:59
3 224 0101V0 2388 2019-01-01 14:47:23
4 283 0101V0 2727 2019-01-01 11:25:09
5 324 0101V0 2707 2019-01-01 07:26:47
The box_pos_time.msg
has been generated automatically and the fields are
extracted/joined correctly! The structure of the dataset
directory
is now as follows.
dataset +-- conf | +-- d2m.rc | +-- vessels.hk | +-- plan.cb +-- raw | +-- vessel_info.csv | +-- box_info.csv +-- plan | +-- berthing.msg | +-- box_pos_time.msg
We can also add the recipe of the file berthing.msg
into the cookbook, by
adding the following function to plan.cb
(or another .cb
file, there can be as
many .cb
files as you want in the conf
directory).
@recipe("plan/berthing.msg")
def gen_berthing_plan(cb):
df = cb.DS["VesselArrivalID", "Length", "ArrivalTime"]
df["MooringPosition"] = df.apply(lambda v: np.random.randint(0, cb.DS.QUAY_LENGTH - v.Length), axis=1)
df["MooringTime"] = df.ArrivalTime + pd.to_timedelta(np.random.random(size=len(df)) * cb.DS.MAX_WAITING_TIME, unit="s")
df["HandlingTime"] = pd.to_timedelta(np.random.uniform(4, 12, size=len(df)), unit="h")
df["HandlingTime"] = pd.to_timedelta(np.random.uniform(4, 12, size=len(df)), unit="h")
return df
Now, let's delete the whole plan
directory and then access boxes' information again.
>>> ds["BoxID", "LoadingVesselArrivalID", "LoadingPosition", "LoadingTime"]
Generating plan/box_pos_time.msg
[] => ['plan/box_pos_time.msg'] By <CookBook.gen_box_pos_time>
Generating plan/berthing.msg
['raw/vessel_info.csv'] => ['plan/berthing.msg'] By <CookBook.gen_berthing_plan>
Base: plan/berthing.msg
Joining: raw/vessel_info.csv
Base: raw/box_info.csv
Base: raw/box_info.csv
Base: plan/box_pos_time.msg
Joining: raw/box_info.csv
BoxID LoadingVesselArrivalID LoadingPosition LoadingTime
0 0 0106V3 625 2019-01-06 11:02:05
1 84 0128V4 2648 2019-01-28 08:12:10
2 105 0110V1 2002 2019-01-11 00:01:40
3 129 0119V1 1144 2019-01-19 23:47:53
4 132 0116V6 2626 2019-01-16 19:53:44
5 142 0101V2 2406 2019-01-02 07:28:23
...
Whoosh! The data are generated again!
We can also use variables in configuration entities.
Say, we may want to generate different instances of berthing.msg
and box_pos_time.msg
for different experiments. What we can do is modifying the configuration
file and the recipes correspondingly as follows.
- DATA:
...
'plan_{exp}':
berthing.msg:
...
box_pos_time.msg:
...
@recipe("plan_{exp}/berthing.msg")
def gen_berthing_plan(cb):
...
@recipe("plan_{exp}/box_pos_time.msg")
def gen_box_pos_time(cb):
...
We can then indicate the {exp}
values when create the DataSource
object or
calling the load
/dump
method.
>>> ds = DataSource("./dataset", exp="exp_1")
>>> ds["BoxID", "LoadingPosition"]
Generating plan_exp_1/box_pos_time.msg
[] => ['plan_{exp}/box_pos_time.msg'] By <CookBook.gen_box_pos_time>
Generating plan_exp_1/berthing.msg
[] => ['plan_{exp}/berthing.msg'] By <CookBook.gen_berthing_plan>
Base: raw/vessel_info.csv
Base: plan_{exp}/berthing.msg
Joining: raw/vessel_info.csv
Base: raw/box_info.csv
Base: raw/box_info.csv
Base: plan_{exp}/box_pos_time.msg
BoxID LoadingPosition
0 0 1437
1 84 2007
2 105 733
3 129 2060
4 132 1476
5 142 2145
...
>>> ds = DataSource("./dataset", exp="exp_2")
>>> ds["BoxID", "LoadingPosition"]
Generating plan_exp_2/box_pos_time.msg
[] => ['plan_{exp}/box_pos_time.msg'] By <CookBook.gen_box_pos_time>
Generating plan_exp_2/berthing.msg
[] => ['plan_{exp}/berthing.msg'] By <CookBook.gen_berthing_plan>
Base: raw/vessel_info.csv
Base: plan_{exp}/berthing.msg
Joining: raw/vessel_info.csv
Base: raw/box_info.csv
Base: raw/box_info.csv
Base: plan_{exp}/box_pos_time.msg
BoxID LoadingPosition
0 0 659
1 84 1771
2 105 2261
3 129 2136
4 132 2208
5 142 2140
...
>>> ds.load("plan_{exp}/box_pos_time.msg", exp="exp_1")
BoxID UnloadingPosition UnloadingTime LoadingPosition LoadingTime
0 0 868 2019-01-01 22:53:41 1437 2019-01-06 13:52:45
1 84 909 2019-01-01 21:50:22 2007 2019-01-28 07:29:07
2 105 968 2019-01-01 19:56:14 733 2019-01-10 19:35:00
3 129 909 2019-01-01 23:29:08 2060 2019-01-19 17:08:25
4 132 696 2019-01-01 23:10:09 1476 2019-01-16 14:07:04
5 142 702 2019-01-01 19:33:32 2145 2019-01-02 01:58:43
...
After executing these codes, the structure of the dataset
directory
becomes
dataset
+-- conf
| +-- d2m.rc
| +-- vessels.hk
| +-- plan.cb
+-- raw
| +-- vessel_info.csv
| +-- box_info.csv
+-- plan_exp_1
| +-- berthing.msg
| +-- box_pos_time.msg
+-- plan_exp_2
| +-- berthing.msg
| +-- box_pos_time.msg
Sometimes, PyD2M can automatically generate recipes for use.
For example, if we want to save the boxes' unloading information separately
in a file named box_unloading_info.csv
, which is defined as follows.
- DATA:
...
tmp:
"{exp}":
box_unloading_info:
FIELDS:
- BoxID: str
- UnloadingVesselID: str
- UnloadingPosition: int
- UnloadingTime: datetime64[s]
Instead of writing a function like
@recipe("tmp/{exp}/box_unloading_info.csv")
def gen_box_unloading_info(cb):
return cb.DS["BoxID", "UnloadingVesselID", "UnloadingPosition", "UnloadingTime"]
in a .cb
file, we can do
@recipe("tmp/{exp}/box_unloading_info.csv")
def gen_box_unloading_info(cb):
return cb.DS.autogen('tmp/{exp}/box_unloading_info.csv')
or simply
from pyd2m.cookbook import auto_recipe
auto_recipe("tmp/{exp}/box_unloading_info.csv")
On the other hand, Steps are some pre-defined operation that can simplify the recipes. Say, if we want to save different vessel's berthing info in separate files
- DATA:
...
'plan_{exp}':
'{VesselID}':
berthing.csv:
FIELDS:
- ArrivalTime: datetime64[s]
- MooringPosition: int
- MooringTime: datetime64[s]
, we can write a recipe like
from pyd2m.cookbook import recipe
from pyd2m.cookbook.steps import groupby
import pandas as pd
@recipe("plan_{exp}/{VesselID}/berthing.csv")
def gen_vessel_berthing_info(cb):
vels = ds["VesselID", "ArrivalTime", "VesselArrivalID", "MooringPosition", "MooringTime"]
return groupby(vels, "VesselID", drop_index=True)
At last, the quick recipes can also generate recipes automatically, by
following some pre-defined strategies. Say, if we want to concat the contents in all
plan_{exp}/{VesselID}/berthing.csv
into another file
plan_{exp}/yet_another_berthing_info_file.csv
which has the same fields
as in these berthing.csv
files, the recipe can be automatically generated
by
from pyd2m.cookbook import quick_recipe
quick_recipe("concat",
ingredients=[plan_{exp}/{VesselID}/berthing.csv"],
dishes=[plan_{exp}/yet_another_berthing_info_file.csv"],
axis=0)