Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include fails silently and new feature flag functionality #367

Merged
merged 1 commit into from
Jun 4, 2020

Conversation

TheRealFalcon
Copy link
Member

WIP PR is for discussion, not review.

If an #include line fails, we only get a warning in the stdout log, but no failure in the result.json. For more background, see https://bugs.launchpad.net/cloud-init/+bug/1734939

I replaced the possible log warnings with exceptions (including a similar warning for a failed decompression I saw in the same file). I now get this in the result.json (the URL intentionally has an extra character at the end):

 "v1": {
  "datasource": "DataSourceNoCloud [seed=ds_config][dsmode=net]",
  "errors": [
   "404 Client Error: Not Found for url: https://github.com/raw/canonical/cloud-init/master/doc/examples/cloud-config-run-cmds.txtd"
  ]
 }
}

Is this too blunt an instrument? Previously, if one piece failed, we'd log the warning and continue. Raising an exception stops any additional processing. Do we want that? I was initially thinking no, but if you have a later piece that relies on an earlier piece having run, you probably don't want the later piece being run, do you?

Is there currently any other way of signaling errors without raising exceptions?

If we do want to use exception, is Exception ok here, or would custom exceptions be useful here?

@OddBloke OddBloke self-assigned this May 14, 2020
@OddBloke
Copy link
Collaborator

Is there currently any other way of signaling errors without raising exceptions?

I'm not familiar with the error signalling, so I'm going to investigate it myself and document my investigation here.

status.json and result.json are populated by status_wrapper, specifically here (functor is a CLI action function, like main_init):

try:
ret = functor(name, args)
if mode in ('init', 'init-local'):
(datasource, errors) = ret
if datasource is not None:
v1['datasource'] = str(datasource)
else:
errors = ret
v1[mode]['errors'] = [str(e) for e in errors]
except Exception as e:
util.logexc(LOG, "failed stage %s", mode)
print_exc("failed run of stage %s" % mode)
v1[mode]['errors'] = [str(e)]

So we can see that exceptions aren't the only way that errors can be populated, the CLI action function can also return them.

status_wrapper is only used for the init and modules commands:

if name in ("modules", "init"):
functor = status_wrapper
so that means the only two CLI action functions we need examine are main_init and main_modules.

I believe that main_init will be the relevant function for #include, but I've asked you in IRC for a traceback to confirm this. Assuming it is, it returns at multiple different points, so we'll need to examine whether there's already a return where we want it that we can

From the traceback you provided in IRC, we can see that main_init is the relevant function here, and more specifically, this line:

# update fully realizes user-data (pulling in #include if necessary)
init.update()

@TheRealFalcon
Copy link
Member Author

Traceback:

2020-05-14 20:02:57,205 - user_data.py[WARNING]: 404 Client Error: Not Found for url: https://github.com/raw/canonical/cloud-init/master/doc/examples/cloud-config-run-cmds.txtd
2020-05-14 20:02:57,205 - util.py[WARNING]: failed stage init
2020-05-14 20:02:57,205 - util.py[DEBUG]: failed stage init
Traceback (most recent call last):
  File "/home/james/cloud-init/cloudinit/user_data.py", line 227, in _do_include
    ssl_details=self.ssl_details)
  File "/home/james/cloud-init/cloudinit/url_helper.py", line 101, in read_file_or_url
    return readurl(url, **kwargs)
  File "/home/james/cloud-init/cloudinit/url_helper.py", line 340, in readurl
    raise excps[-1]
cloudinit.url_helper.UrlError: 404 Client Error: Not Found for url: https://github.com/raw/canonical/cloud-init/master/doc/examples/cloud-config-run-cmds.txtd

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/james/cloud-init/cloudinit/cmd/main.py", line 653, in status_wrapper
    ret = functor(name, args)
  File "/home/james/cloud-init/cloudinit/cmd/main.py", line 377, in main_init
    init.update()
  File "/home/james/cloud-init/cloudinit/stages.py", line 363, in update
    self._store_userdata()
  File "/home/james/cloud-init/cloudinit/stages.py", line 390, in _store_userdata
    processed_ud = self.datasource.get_userdata()
  File "/home/james/cloud-init/cloudinit/sources/__init__.py", line 385, in get_userdata
    self.userdata = self.ud_proc.process(self.get_userdata_raw())
  File "/home/james/cloud-init/cloudinit/user_data.py", line 83, in process
    self._process_msg(convert_string(blob), accumulating_msg)
  File "/home/james/cloud-init/cloudinit/user_data.py", line 147, in _process_msg
    self._do_include(payload, append_msg)
  File "/home/james/cloud-init/cloudinit/user_data.py", line 245, in _do_include
    raise Exception(message) from urle
Exception: 404 Client Error: Not Found for url: https://github.com/raw/canonical/cloud-init/master/doc/examples/cloud-config-run-cmds.txtd

@OddBloke
Copy link
Collaborator

Previously, if one piece failed, we'd log the warning and continue.

To some extent, the rest of this question doesn't really matter. People may be relying on #include processing not causing cloud-init to terminate, so we should retain that behaviour.

Raising an exception stops any additional processing. Do we want that? I was initially thinking no, but if you have a later piece that relies on an earlier piece having run, you probably don't want the later piece being run, do you?

That said, I think there are a couple of stronger arguments to make for the current behaviour:

  • if fetching your #include user-data fails, you probably still want cloud-init to fetch your SSH keys from your cloud and apply them to the instance, so that you can log into the system and debug the failure
  • if you've specified multipart cloud-config, then the rest of the user-data can still be applied to the system, even if fetching one part of it fails

Does that sound reasonable?

@TheRealFalcon
Copy link
Member Author

@OddBloke
Yep. I haven't really interacted with any multi-part configs yet, so I'm still not really aware of how they're used or if they were "stacked" at all. But you're right that

People may be relying on #include processing not causing cloud-init to terminate, so we should retain that behaviour.

is the most important piece.

@smoser
Copy link
Collaborator

smoser commented May 14, 2020

My $0.02.

Is this too blunt an instrument? Previously, if one piece failed, we'd log the warning and continue. Raising an exception stops any additional processing. Do we want that? I was initially thinking no, but if you have a later piece that relies on an earlier piece having run, you probably don't want the later piece being run, do you?

My personal opinion has changed over time on this point. I started off as "ignore errors, log WARN, keep going". Now I'm very much on the other end. No one reads warnings.

I'd argue for logging the exception wherever you can and exiting failure. Anything else will just get ignored, and will make the original problem harder to find.

I think that @OddBloke is suggesting that we can't make such a change as it would be backwards in-compatable. I thikn that is arguable. But we definitely can make the change going forward, and just have to maintain the old behavior in Ubuntu releases. We have done that for other things. Its not that painful.

I honestly think failing is the right behavior. And you can do the right thing.

@OddBloke
Copy link
Collaborator

OK, if Scott would be comfortable making this change going forward, then that moves the needle for me significantly. I would also be comfortable making this an error condition.

One wrinkle to make sure we're agreed on: the expectation is that the user will be able to debug this failure by reading the console logs from their instance provided by their cloud, because the hard error out likely means they don't have SSH access (I think?). So we need to ensure that this error is emitted to stdout. (Maybe all unhandled exceptions are, because that's what Python does by default, but we should double-check.)

We'll also want a feature flag to easily toggle this behaviour off in existing Ubuntu releases (by which I mean something like error_on_include_failure).

Does that sound right?

@OddBloke
Copy link
Collaborator

we could introduce #require: an #include that stops the entire execution chain if it fails

I like the idea, but my suspicion is that #include isn't used a huge amount (and in most cases I would expect is very unlikely to fail except via a misconfiguration), so we shouldn't invest too much extra effort into it up front. My feeling is that we should make it do the right thing, as smoser put it, and if that isn't the right thing for all of our users then (they can file a bug report and) we can address it at that point (with #require or #optional-include or some other solution that presents itself once we have a concrete problem to solve).

@TheRealFalcon
Copy link
Member Author

@OddBloke , I remember you mentioning recently that you wanted to implement feature flags, so I'm assuming we don't have any yet. I pushed another commit with a possible implementation using a feature flag. Again, still WIP, but curious what you think of it. I'm still not fully up to speed on how we do our versioning with packages and how new updates get versioned into old OSes, so maybe the way I'm checking versions wouldn't work. If this way would work, one big caveat is once we release we'd immediately need to push the version in the repo to something higher (i.e., my '20.2-post1' here) as a temporary "higher than released, but not actually our next version yet" so that new features can be triggered correctly as being greater than the released version.

@OddBloke
Copy link
Collaborator

OddBloke commented May 15, 2020

I remember you mentioning recently that you wanted to implement feature flags, so I'm assuming we don't have any yet.

So in the past we have used configuration to toggle changes in behaviour, see 1d5e9ae for an example of introducing such a configuration option, and azure-apply-network-config-false.patch in ubuntu/xenial for the way in which we revert that behaviour on xenial. (That is a quilt patch, it's applied during package build.)

My feeling though, is that not all behavioural changes in cloud-init should be configurable. Even introducing a boolean configuration option effectively doubles our potential support burden: to continue the above example, we could now get bug reports from xenial Azure instances which are in the default configuration (i.e. the previous behaviour) or in the non-default configuration, where the network configuration of instances can be substantially different. (We can also get the opposite on more recent releases: e.g. focal with secondary NICs disabled.) Now, to be clear, I'm not picking on this Azure change, it seems to me that it should be configurable.

So what's different with this #include change? A couple of things, I think. Firstly, and most importantly, you can easily work around this change if its causing you problems: stop specifying a broken #include URL in your user-data. Contrast this with the Azure networking case: without the configuration option, there's no easy way to configure your secondary NICs on xenial. Secondly, this change is to cloud-init's internals, rather than explicitly changing the behaviour of instances on which cloud-init is installed. This is a more subtle distinction, so let me try restating it: it is reasonable for us to unilaterally change how we expect users to provide configuration to us (though, of course, we should be thoughtful when so-doing), but it's less reasonable for us to change the meaning of configurations (including the null configuration, i.e. our default behaviour) without a way for users to opt-out. Apologies if this still isn't clear; my first point stands regardless. :-)

I'm still not fully up to speed on how we do our versioning with packages and how new updates get versioned into old OSes, so maybe the way I'm checking versions wouldn't work.

Yep, you are correct that it won't work. We backport upstream snapshots of cloud-init to older Ubuntu releases, so xenial will be on 20.2 once this change ends up there. I think it's reasonable to hard-code this to True here, and we'll introduce a quilt patch for all current Ubuntu releases which will do s/True/False/. What do you think?

@TheRealFalcon
Copy link
Member Author

Apologies if this still isn't clear; my first point stands regardless. :-)

I think i get it. We don't want to break user configuration normally, so we should be wary of making changes like this if it refers to user configuration. But #include is more of an internal change and one more easily worked around, so we can make it here. Let me know if I'm missing the point 😄

Yep, you are correct that it won't work. We backport upstream snapshots of cloud-init to older Ubuntu releases, so xenial will be on 20.2 once this change ends up there. I think it's reasonable to hard-code this to True here, and we'll introduce a quilt patch for all current Ubuntu releases which will do s/True/False/. What do you think?

I don't know, tbh. We already have an infrastructure in place for applying these quilt patches that seems to work, but as an outside person coming in, it feels fairly complex. There might be aspects of versioning that justify that complexity that I'm not aware of yet, but if I were approaching the problem from scratch, I would put the feature change logic into code, rather than patches. I.e., use something from the environment or one or two code fields in a separate file that would change per branch or release that uniquely identify the environment we're in, and then make all of the changes based on those fields, so the only thing that would be different between release/branch is that unique identifier. Does that make sense?

Given that we already have this process for applying merges, I suppose it makes the most sense to just stick with that. But in that case, does it make sense to add a separate features.py file? If we're going to have a quilt patch, should I remove this features.py and move ERROR_ON_USER_DATA_FAILURE into user_data.py?

@OddBloke
Copy link
Collaborator

OddBloke commented May 15, 2020 via email

@TheRealFalcon
Copy link
Member Author

Some comments on a few of these...

  • This would presumably push some Ubuntu-specific logic up into the
    upstream codebase and out of our Ubuntu packaging.

I'm assuming that's a good thing?

  • Relatedly, how would other downstreams (e.g. Debian, SUSE) handle
    manipulating the feature flags if they want to retain previous
    behaviour for an existing release? (It's certainly not practical to
    capture every downstream's feature flag logic into the upstream
    codebase.)

Curious the background on this one. Do other downstreams apply their own patches when they package that we have no knowledge of? If we know about them, aren't we already capturing their feature flags, just in a possibly less efficient manner? If we don't know about them, why is it our concern?

  • How do we test that our feature flag logic is correct?

Shouldn't be too hard to mock patch the flags...either directly or indirectly by instead patching the release version. That also makes it easy to test expected behavior on an old release version without having to apply patches first. Do we currently have a way of testing that the patched behavior is correct?

Regardless of the resolution to the above conversation, I still think a
separate features.py makes sense: it gives us a single place to
examine to find behavioural differences between releases, rather than
having to hunt through quilt patches. The ideal ending point for this
stream of work, I think, is our per-release quilt patches consisting
only of (a) modifications to default configuration (for things we would
like users to be able to modify), and (b) features.py for everything
else. (In your proposal above, I'm assuming the per-release code fields
are at the top of features.py; if not, then
s/features.py/that_file_instead.py/ in this ideal vision.)

+1

@OddBloke
Copy link
Collaborator

OddBloke commented May 15, 2020 via email

@TheRealFalcon
Copy link
Member Author

TheRealFalcon commented May 18, 2020

Thanks. I wasn't aware of the downstream impact or the fact that we want to keep our release info out of the core logic.

One more possibility that I have been considering is having an initial features file with all of the default values hardcoded (like you had initially suggested), but then also having a specific extension point where branches/releases/downstreams can provide specific overrides. It could be as crude as something like this:

ERROR_ON_USER_DATA_FAILURE = True
# Other flags...

try:
    from feature_overrides import *
except ImportError:
    pass

By default the file wouldn't even exist on master. But branches/downstreams could add the file to override defaults. To me, something like this could offer a number of benefits:

  1. It provides a way to easily toggle the feature flags in a way that doesn't rely on patching
  2. Since the file wouldn't exist on master, we could commit it to branches with no risk of merge conflicts when merging from master
  3. It could limit our testing burden. We could promise to test only the upstream defaults (and any other configuration we care about). If somebody downstream wants to change the configuration, it's still their burden to test it.

@OddBloke
Copy link
Collaborator

OddBloke commented May 18, 2020 via email

@blackboxsw
Copy link
Collaborator

Thanks. I wasn't aware of the downstream impact or the fact that we want to keep our release info out of the core logic.

One more possibility that I have been considering is having an initial features file with all of the default values hardcoded (like you had initially suggested), but then also having a specific extension point where branches/releases/downstreams can provide specific overrides. It could be as crude as something like this:

ERROR_ON_USER_DATA_FAILURE = True
# Other flags...

try:
    from feature_overrides import *
except ImportError:
    pass

By default the file wouldn't even exist on master. But branches/downstreams could add the file to override defaults. To me, something like this could offer a number of benefits:

I really like the idea of a single feature.py module which captures hard-coded defaults T/F to capture features that we know are toggled differently on different downstreams (given that we know that xenial may not adopt FEATURE_X because it may introduce unexpected behavioral changes.

I also like the optional feature_overrides module you suggested specifially because your you suggestion about patch collision avoidance.

What gap I feel like we still aren't tracking though is that we don't know in master that ubuntu/xenial has a feature overridden. Could we express this in tip so we don't have to provide an override module in the ubuntu/* downstream branches? What do folks think about adding a feature_overrides module something like the following which downstreams could drop or modify as they see fit?

sys_info = util.system_info()
if sys_info['variant'] == 'ubuntu' and sys_info['dist'][2] in ('xenial', 'bionic'):
   ERROR_ON_USER_DATA_FAILURE = False
1. It provides a way to easily toggle the feature flags in a way that doesn't rely on patching

I like that a lot.

2. Since the file wouldn't exist on master, we could commit it to branches with no risk of merge conflicts when merging from master

I wonder whether we could at least provide an override in master for the known feature disabling we are going to do to separately applying patches to ubuntu/xenial|bionic|focal etc when as part of our SRU process. Those work items are more likely to be forgotten without a process around RELEASE_BLOCKER or SRU_BLOCKER comments in the code.

3. It could limit our testing burden. We could promise to test only the upstream defaults (and any other configuration we care about). If somebody downstream wants to change the configuration, it's still their burden to test it.

Since our ubuntu/* branches (and SRU's to xenial. bionic etc) are the downstream, we're still on the hook there.

@blackboxsw
Copy link
Collaborator

I think this is the most important open question I have: do we only test with the upstream default? .. I would like to be able to limit our test matrix, so let's dig into the details....

While we have a feature flag settable in our code, I'd vote for upstreams 'burden' as:

  • unittest coverage for mocking both True and False feature paths
  • a single manual/or automated cloudtest to verify expected downstream feature behavior (on or off) on Xenial, Bionic Eoan Focal and Groovy

The reason I suggest the mock of that feature value is so unittests don't have to get exposed to whether the the code happens to introduce a feature_override, since we know this is an expected behavior in downstreams (specifically ours, but broadly other downstreams)

And I think that testing both branches of a feature flag would be more efficient in the upstream codebase, and as these are generic feature flags (rather than being specifically fitted to Ubuntu's needs), I don't think there is anything dissonant about those tests sitting in the upstream codebase.

+1

A side note, another open question: do we need a defined lifetime/deprecation policy for feature flags? From an Ubuntu POV, we won't need ERROR_ON_USER_DATA_FAILURE once we are no longer backporting full upstream releases to focal (which could be as ahem early as 2025). If there are no other downstreams requiring the flag at that point, we could remove it (and the = False codepaths) from cloud-init's codebase.

@OddBloke, I think dropping a feature flag entirely makes sense when all our Ubuntu downstream support of that disabled flag ends. I'm presuming that other downstreams will continue to patch out behavior that they find irrelevant, if we have a feature boundary. Downstreams can revert that change if they want to continue using that feature.

@TheRealFalcon
Copy link
Member Author

TheRealFalcon commented May 19, 2020

(this comment was supposed to go up right when I pushed my last commit, but for some reason it never went through. @blackboxsw , I typed this up before you posted your most recent comments)

I pushed a commit with (most) of the changes needed for this particular feature.

A summary of what's being proposed (mostly copied from an earlier comment):
Have an initial features file with all of the default values hardcoded, and also have a specific extension point where branches/releases/downstreams can provide specific overrides. It would look something like this:

ERROR_ON_USER_DATA_FAILURE = True
# Other flags...

try:
    from feature_overrides import *
except ImportError:
    pass

By default the file wouldn't even exist on master. But branches/downstreams could add the file to override defaults.

Benefits:

  • It provides a way to easily toggle the feature flags in a way that doesn't rely on patching
  • Since the file wouldn't exist on master, we could commit it to branches with no risk of merge conflicts when merging from master

We would be responsible for all possible flag paths.

Quick comment on terminology after a quick googling...
Some people use feature flags to refer to only flags that can be toggled at runtime. Others include configuration baked into a release. I don't think the way we're using the term is wrong.

A quote from Martin Fowler:

Managing toggle configuration via source control and re-deployments is preferable, if the nature of the feature flag allows it. Managing toggle configuration via source control gives us the same benefits that we get by using source control for things like infrastructure as code. It can allows toggle configuration to live alongside the codebase being toggled, which provides a really big win: toggle configuration will move through your Continuous Delivery pipeline in the exact same way as a code change or an infrastructure change would. This enables the full the benefits of CD - repeatable builds which are verified in a consistent way across environments. It also greatly reduces the testing burden of feature flags. There is less need to verify how the release will perform with both a toggle Off and On, since that state is baked into the release and won't be changed (for less dynamic flags at least). Another benefit of toggle configuration living side-by-side in source control is that we can easily see the state of the toggle in previous releases, and easily recreate previous releases if needed.

@TheRealFalcon
Copy link
Member Author

TheRealFalcon commented May 19, 2020

What gap I feel like we still aren't tracking though is that we don't know in master that ubuntu/xenial has a feature overridden. Could we express this in tip so we don't have to provide an override module in the ubuntu/* downstream branches?

@blackboxsw Can you explain your thinking more here? I'm not sure I understand. Why should master care about what xenial is doing? Shouldn't our own downstreams function the same as others (non-ubuntu)? Why do you think it would be simpler to express this in master rather than committing an override once in a downstream branch?

@blackboxsw
Copy link
Collaborator

blackboxsw commented May 20, 2020

Why do you think it would be simpler to express this in master rather than committing an override once in a downstream branch?

There are a few meager benefits I can think of (but not a strong argument for):

  • a single place in the upstream repo that represents our known feature configuration for the downstreams that we own
    • helps inform us of downstream use of feature flags and when we can EOL/retire a given feature path
  • no additional PRs required to ubuntu/* branches to unset flags which we know at the time created
  • Due to the fact that our SRU release cycles are fairly far apart, we have ended up applying release-specific behavioral patches significantly after feature was landed in master. In the past, some of those SRU-related default patches have been missed. I was grasping for a mechanism by which we can instrument and land any downstream-specific feature behavior without having to wait for an SRU. I want to avoid a belated followup on SRU release day to instrument a feature toggle.

Again, not a strong opinion, and I can be easily swayed. The more I think about it, we can probably still decouple the feature disabling work from our SRU release process by adding the downstream feature_override values regardless of whether that specific feature flag support is actually read by the ubuntu/* release branch because it's only a variable that happens to be unreferenced until the next new-upstream-snapshot is run.

@mitechie
Copy link
Contributor

Thanks for this interesting conversation and there's a lot of details on how things work today in there. It sounds like things are heading a little more feature flag-like in that there is a "is this on or off" type setup.

I challenge the "support burden" of having the flags in the upstream codebase as the burden is there regardless. In the case of it applied downstream there's the discussion of the test requirements of that. Ideally, cloud-init would be able to test its own expected behavior if the feature is on or off and be able to do it in a more "unit-test" style in that you can get close to the code and validate the flag does what it says it does.

However, if downstream releases were flipping that flag and wanting assurances at a broader functional test level that would be the downstream burden to validate and be sure of. For instance, in the case of Ubuntu series behaving differently we'd have a more pure upstream test cases that verify cloud-init behaves, and broader series tests that verify they continue to work in their expected feature configuration.

My main concern with the quilt patches is that if you're chasing the problem in code you can't validate what code is actually being run/seen until you evaluate not just what's in the upstream tree, but also in these patches. Which patches apply? In what order? It makes it much more difficult in my mind to reason about what the person with an issue is struggling with. It also means it's near impossible to perform any sort of testing without going through the build process as part of that. That means that the iteration speed of the work is significantly impacted. Obviously we're not going to change the world, but I'm extremely hesitant to build upon.

Personally, I'd rather see flags like the off on the side file that is read/processed such that downstream can provide a single "overrides" or "flags" file that flips things. Rather than a list of patches anything they'd have would be clearly in a specific file and you can trace the history of that file over time if needed to aid in finding older issues vs tracking the existence of patch files that came and went across time.

@OddBloke
Copy link
Collaborator

OddBloke commented May 21, 2020 via email

@mitechie
Copy link
Contributor

Thanks, I had some confusion that the patches were of the more code variety vs the flags on vs off. It seems that not the idea at all and the patch is more about setting what the true/false settings are for flags that are achievable in the primary codebase.

In looking at the history of patches that makes sense as well. The only one that gives me the :/ is the one around ubuntu-advantage-revert-tip.patch but I assume there's a story there.

Thanks for the clarification and a good amount of education on the process and specifics.

@TheRealFalcon
Copy link
Member Author

@OddBloke Can you take another look at my last two commits?

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated documentation here explaining the intent behind feature flag development. I'm good with your approach to separate the feature_overrides.py to specific downstream branches via quilt patches. It's easy enough for us to track them in the release-specific downstreams.

Thanks @OddBloke and @TheRealFalcon for all the discussion here on this.

@@ -0,0 +1,57 @@
"""
This file is for testing the feature flag functionality itself,
NOT for testing that any individual feature flag
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/that any/any/

# we call this fixture, cloudinit.feature_overrides doesn't exist yet, so
# we need to reload cloudinit.features first before we can reload
# cloudinit.feature_overrides, hence the three calls here.
reload(cloudinit.features)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow thanks for this. I've never played with importlib directly. Great contextual comments

Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just reviewed the feature flag logic thus far, not the actual use of the feature flag being introduced.

@@ -0,0 +1,23 @@
# This file is part of cloud-init. See LICENSE file for license information.
# pylint: disable=wildcard-import
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit (because it probably won't ever matter here), but we could move this closer to line 21 (or onto line 21, potentially) to avoid disabling this warning for the entire file.

HACKING.rst Outdated
Feature Flags
-------------

Feature flags are used as a way to easily toggle configuration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of ~replicating the content here in two places, I would suggest using something like:

.. automodule:: cloudinit.features

which will include the docstring from cloudinit.features inline. Then we can just maintain this information in one place (and it's easily accessible via both the documentation and in the appropriate place in the source code).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Makes the docstring slightly more awkard, but I'm assuming that's ok.
image

the flag and intended lifetime
"""

# If there is a failure in obtaining user data (i.e., #include or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make this a docstring, then we may be able to include it in the generated documentation relatively easily by modifying my earlier proposal to:

.. automodule:: cloudinit.features
   :members:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I place the docstring after the variable, it works, but do we want this kind of granularity in the HACKING.rst? I would think not, but I'll defer to you.
image

cloudinit/tests/test_features.py Show resolved Hide resolved
Comment on lines 46 to 53
@pytest.mark.parametrize('create_override',
[{'ERROR_ON_USER_DATA_FAILURE': False}],
indirect=['create_override'])
def test_feature_with_override(self, create_override):
from cloudinit.features import ERROR_ON_USER_DATA_FAILURE
assert ERROR_ON_USER_DATA_FAILURE is False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indirectly parameterising a fixture that we are requesting in this test definition seems a little strange. I wonder if this is indicating that create_override might be better as a context manager? This would make this test definition something like:

    def test_feature_with_override(self):
        with create_override({'ERROR_ON_USER_DATA_FAILURE': False}):
            from cloudinit.features import ERROR_ON_USER_DATA_FAILURE
            assert ERROR_ON_USER_DATA_FAILURE is False

(I think the only change required in create_override would be to wrap the yield in a try/finally, and to decorate it with contextlib.contextmanager.)

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of fixtures as a way of achieving a common setup/teardown for tests. I agree that the syntax is a little wonky, but the indirect parametrization is really just a way of passing arugments to a fixture. To me that seems more idiomatic (in a pytest sense) than rolling my own with context managers. That said, it's not a strongly held view, so if you'd really prefer I change it, let me know.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a great point about using fixtures as a common setup/teardown mechanism. Let's stick with it.

One thing I think we can do to reduce the repetition slightly is pass indirect=True instead of repeating the fixture name again (I just tested locally and the tests still passed).

Comment on lines 32 to 34
reload(cloudinit.features)
reload(cloudinit.feature_overrides)
reload(cloudinit.features)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can actually simplify this a little by removing the internally-cached feature_overrides module if present, then reloading cloudinit.features (which will then have to reimport our updated feature_overrides:

Suggested change
reload(cloudinit.features)
reload(cloudinit.feature_overrides)
reload(cloudinit.features)
sys.modules.pop('cloudinit.feature_overrides', None)
reload(cloudinit.features)

and we could potentially even do the same for cloudinit.features (but that could cause problems with other tests so I wouldn't recommend it).

cloudinit/tests/test_features.py Show resolved Hide resolved
cloudinit/tests/test_features.py Show resolved Hide resolved
@TheRealFalcon
Copy link
Member Author

@OddBloke

I've just reviewed the feature flag logic thus far, not the actual use of the feature flag being introduced.

Would it make more sense for me to move the functionality around the flag being introduced into a separate PR and rename this PR?

Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, just focused on the feature flag functionality; will complete my review of the rest of the code later today.

cloudinit/tests/test_features.py Show resolved Hide resolved
cloudinit/tests/test_features.py Show resolved Hide resolved
Comment on lines 46 to 53
@pytest.mark.parametrize('create_override',
[{'ERROR_ON_USER_DATA_FAILURE': False}],
indirect=['create_override'])
def test_feature_with_override(self, create_override):
from cloudinit.features import ERROR_ON_USER_DATA_FAILURE
assert ERROR_ON_USER_DATA_FAILURE is False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a great point about using fixtures as a common setup/teardown mechanism. Let's stick with it.

One thing I think we can do to reduce the repetition slightly is pass indirect=True instead of repeating the fixture name again (I just tested locally and the tests still passed).

cloudinit/features.py Show resolved Hide resolved
"""
If there is a failure in obtaining user data (i.e., #include or
decompress fails), old behavior is to log a warning and proceed.
After the 20.2 release, and we instead raise an exception.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After the 20.2 release, and we instead raise an exception.
After the 20.2 release, we instead raise an exception.

Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual functionality change itself LGTM, with a couple of minor comments. Thanks!

Comment on lines 75 to 78
if source_exception:
raise Exception(error_message) from source_exception
else:
raise Exception(error_message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested on xenial's Python and raise <exception> from None appears to behave the same as raise <exception>:

Python 3.5.2 (default, Apr 16 2020, 17:47:17) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> raise Exception()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
Exception
>>> raise Exception() from None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
Exception

so we can potentially simplify this to:

Suggested change
if source_exception:
raise Exception(error_message) from source_exception
else:
raise Exception(error_message)
raise Exception(error_message) from source_exception

cloudinit/user_data.py Show resolved Hide resolved
@TheRealFalcon TheRealFalcon changed the title WIP include fails silently Include fails silently and new feature flag functionality Jun 3, 2020
@TheRealFalcon
Copy link
Member Author

Addressed comments and altered the way of importing feature_overrides due to the difference in errors that can be raised depending on python version

Comment on lines 64 to 67
try:
import cloudinit.features # noqa
except Exception as e:
assert type(e) in (NameError, ImportError)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this test to be checking that an invalid override doesn't cause these exceptions to bubble out of an import of cloudinit.features (so that the system can continue to function if the presence of broken overrides), not that they do bubble out.

(Also we'll want to test and handle SyntaxError too: I think {'SPAM: int': 'EGGS'} would do it.)

@TheRealFalcon
Copy link
Member Author

TheRealFalcon commented Jun 3, 2020

@OddBloke Reverted the exception stuff we talked about and added a conversation item to tomorrow agenda about dealing with exceptions.

Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Build time feature flags are now defined in cloudinit/features.py.
Feature flags can be added to toggle configurtion options or
deprecated features. Feature flag overrides can be placed in
cloudinit/feature_overrides.py. Futher documentation can be found in
HACKING.rst.

Additionally, updated default behavior to exit with an exception
if #include can't retrieve resources as expected. This behavior
can be toggled with a feature flag.

LP: #1734939
@TheRealFalcon
Copy link
Member Author

@OddBloke Rebased, squashed, and added a commit message. Feel free to modify the commit message if you think I missed any important details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants