Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-1740] Name-spaced Parameters #399

Closed
benhorsburgh opened this issue Jun 2, 2020 · 3 comments
Closed

[KED-1740] Name-spaced Parameters #399

benhorsburgh opened this issue Jun 2, 2020 · 3 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@benhorsburgh
Copy link

Description

First of all - I love Kedro 0.16.x namespacing!

When I create a new modular pipeline, I can very easily re-use this across different name spaces, which is intuitive to use thanks to the new conf sub-directories, and pipeline structure. However, the name-spacing of parameters can only be achieved using messy work arounds.

Let me explain using an example.

Let's say I create kedro pipeline create clean_timestamps as a modular pipeline. In this, I create the following pipeline:

def create_pipeline(**kwargs):
    return Pipeline([
        node(
            func=round_timestamps,
            inputs=dict(params="params:clean_timestamps_params", data="data"),
            outputs=dict(rounded="rounded"),
            name="round_timestamps"
        )
    ])

In my conf/base/pipelines/clean_timestamps/catalog.yml file I can make the datasets very obvious, along with name-spaced versions of my datasets.

# no namespace
#data:
#  type: MemoryDataSet
#
#rounded:
#  type: MemoryDataSet

# hour namespace
hour.data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/hour.csv

# minute namespace
minute.data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/minute.csv

minute.rounded:
  type: pandas.CSVDataSet
  filepath: data/05_model_input/minute_rounded.csv

This gives me very clean and clear flexibility to keep the data that I need, and set up new namespaces. Using this, I can create the default and namespaced pipelines very easily.

clean_timestamps_pipeline = clean_timestamps.create_pipeline()

hour_clean_timestamps_pipeline = pipeline(
    clean_timestamps_pipeline,
    namespace="hour"
)

minute_clean_timestamps_pipeline = pipeline(
    clean_timestamps_pipeline,
    namespace="minute"
)

and everything magically works because of how my config has been set up.


Now if we try to do the same with parameters! Intuitively, I should set up my conf/base/pipelines/clean_timestamps/parameters.yml in exactly the same way, which might look like:

# no namespace
clean_timestamps_params:
  round_frequency: "1min"
  round_col: "dteday"

# hour namespace
hour.clean_timestamps_params:
  round_frequency: "1h"
  round_col: "dteday"

# minute namespace
minute.clean_timestamps_params:
    round_frequency: "1min"
    round_col: "dteday"

and in theory everything should work! Kedro of course does not support this yet. The current workaround is that I make use of the parameters argument on pipelines.

hour_clean_timestamps_pipeline = pipeline(
    clean_timestamps_pipeline,
    parameters={"params:clean_timestamps_params": "params:hour.clean_timestamps_params"},
    namespace="hour"
)

Context

This is important because failure to namespace parameters means that:

  1. I need to add the namespace in two separate places
  2. Non-standard namespacing strategies will be used for parameters leading to:
    2.1. Increased learning curves for new team members on projects
    2.2 Increased bugs due to not understanding a team convention
  3. It is not possible to specify parameters using dict(param="my_param") format. This is extremely useful, because it means the only strings in my code are things that are configurable, not placeholder variable names. It is much easier to read, and more maintainable.

Adding namespaced parameters will:

  1. Encourage best practices
  2. Make learning new projects much easier
  3. Decrease tech debt due to developers inventing their own strategies
  4. Make code much more readable
  5. Decrease bugs introduced by relying on complex string mappings

Possible Implementation

Continuing from the above example, I would suggest namespacing is handled in exactly the same way as datasets, but appending a prefix to the expected parameter.

With no namespace: "params:my_params"
With namespace: "params:namespace.my_params"

Possible Alternatives

An alternative may be to suffix, but this would be confusing given the dataset implementation.

@benhorsburgh benhorsburgh added the Issue: Feature Request New feature or improvement to existing feature label Jun 2, 2020
@mzjp2 mzjp2 changed the title Name-spaced Parameters [KED-1740] Name-spaced Parameters Jun 2, 2020
@mzjp2
Copy link
Contributor

mzjp2 commented Jun 2, 2020

I've logged this in our backlog to take a look at. Thank you! @lorenabalan you might be interested in this/have a reply :)

@lorenabalan
Copy link
Contributor

Thank you @benhorsburgh for such a thoughtful writeup!! Your use case makes sense. Here are some of my thoughts, I'd love to hear more on this.

We've had conflicting feedback about this at the time, with some users saying that it's more often that modular pipelines will only differ by one or 2 parameters, and everything else is intact. I think it was also confusing because "." has a special meaning for parameters. Currently users can access nested parameters through nested.a.b.c and pass that to the node. So when there's a reference nested.a, it's ambiguous if the user meant "entry a under nested" or "parameter named nested.a" and they might end up overwriting each other.

I hear your point about the strings! Currently that'd be hard to do even if we got rid of the "params:" prefix, because we use a jmespath-like syntax for accessing nested parameters with "." notation, which dict() would still be unhappy about, so this one needs more thinking.

pull bot pushed a commit to FoundryAI/kedro that referenced this issue Jul 17, 2020
@limdauto
Copy link
Contributor

This has now landed in develop, so it will go out in 0.18.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

4 participants