-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include fails silently and new feature flag functionality #367
Include fails silently and new feature flag functionality #367
Conversation
I'm not familiar with the error signalling, so I'm going to investigate it myself and document my investigation here.
cloud-init/cloudinit/cmd/main.py Lines 652 to 666 in 09492b1
So we can see that exceptions aren't the only way that
cloud-init/cloudinit/cmd/main.py Lines 864 to 865 in 09492b1
main_init and main_modules .
I believe that From the traceback you provided in IRC, we can see that cloud-init/cloudinit/cmd/main.py Lines 376 to 377 in 09492b1
|
Traceback:
|
To some extent, the rest of this question doesn't really matter. People may be relying on #include processing not causing cloud-init to terminate, so we should retain that behaviour.
That said, I think there are a couple of stronger arguments to make for the current behaviour:
Does that sound reasonable? |
@OddBloke
is the most important piece. |
My $0.02.
My personal opinion has changed over time on this point. I started off as "ignore errors, log WARN, keep going". Now I'm very much on the other end. No one reads warnings. I'd argue for logging the exception wherever you can and exiting failure. Anything else will just get ignored, and will make the original problem harder to find. I think that @OddBloke is suggesting that we can't make such a change as it would be backwards in-compatable. I thikn that is arguable. But we definitely can make the change going forward, and just have to maintain the old behavior in Ubuntu releases. We have done that for other things. Its not that painful. I honestly think failing is the right behavior. And you can do the right thing. |
OK, if Scott would be comfortable making this change going forward, then that moves the needle for me significantly. I would also be comfortable making this an error condition. One wrinkle to make sure we're agreed on: the expectation is that the user will be able to debug this failure by reading the console logs from their instance provided by their cloud, because the hard error out likely means they don't have SSH access (I think?). So we need to ensure that this error is emitted to stdout. (Maybe all unhandled exceptions are, because that's what Python does by default, but we should double-check.) We'll also want a feature flag to easily toggle this behaviour off in existing Ubuntu releases (by which I mean something like Does that sound right? |
I like the idea, but my suspicion is that #include isn't used a huge amount (and in most cases I would expect is very unlikely to fail except via a misconfiguration), so we shouldn't invest too much extra effort into it up front. My feeling is that we should make it do the right thing, as smoser put it, and if that isn't the right thing for all of our users then (they can file a bug report and) we can address it at that point (with |
@OddBloke , I remember you mentioning recently that you wanted to implement feature flags, so I'm assuming we don't have any yet. I pushed another commit with a possible implementation using a feature flag. Again, still WIP, but curious what you think of it. I'm still not fully up to speed on how we do our versioning with packages and how new updates get versioned into old OSes, so maybe the way I'm checking versions wouldn't work. If this way would work, one big caveat is once we release we'd immediately need to push the version in the repo to something higher (i.e., my '20.2-post1' here) as a temporary "higher than released, but not actually our next version yet" so that new features can be triggered correctly as being greater than the released version. |
So in the past we have used configuration to toggle changes in behaviour, see 1d5e9ae for an example of introducing such a configuration option, and My feeling though, is that not all behavioural changes in cloud-init should be configurable. Even introducing a boolean configuration option effectively doubles our potential support burden: to continue the above example, we could now get bug reports from xenial Azure instances which are in the default configuration (i.e. the previous behaviour) or in the non-default configuration, where the network configuration of instances can be substantially different. (We can also get the opposite on more recent releases: e.g. focal with secondary NICs disabled.) Now, to be clear, I'm not picking on this Azure change, it seems to me that it should be configurable. So what's different with this #include change? A couple of things, I think. Firstly, and most importantly, you can easily work around this change if its causing you problems: stop specifying a broken #include URL in your user-data. Contrast this with the Azure networking case: without the configuration option, there's no easy way to configure your secondary NICs on xenial. Secondly, this change is to cloud-init's internals, rather than explicitly changing the behaviour of instances on which cloud-init is installed. This is a more subtle distinction, so let me try restating it: it is reasonable for us to unilaterally change how we expect users to provide configuration to us (though, of course, we should be thoughtful when so-doing), but it's less reasonable for us to change the meaning of configurations (including the null configuration, i.e. our default behaviour) without a way for users to opt-out. Apologies if this still isn't clear; my first point stands regardless. :-)
Yep, you are correct that it won't work. We backport upstream snapshots of cloud-init to older Ubuntu releases, so xenial will be on 20.2 once this change ends up there. I think it's reasonable to hard-code this to |
I think i get it. We don't want to break user configuration normally, so we should be wary of making changes like this if it refers to user configuration. But #include is more of an internal change and one more easily worked around, so we can make it here. Let me know if I'm missing the point 😄
I don't know, tbh. We already have an infrastructure in place for applying these quilt patches that seems to work, but as an outside person coming in, it feels fairly complex. There might be aspects of versioning that justify that complexity that I'm not aware of yet, but if I were approaching the problem from scratch, I would put the feature change logic into code, rather than patches. I.e., use something from the environment or one or two code fields in a separate file that would change per branch or release that uniquely identify the environment we're in, and then make all of the changes based on those fields, so the only thing that would be different between release/branch is that unique identifier. Does that make sense? Given that we already have this process for applying merges, I suppose it makes the most sense to just stick with that. But in that case, does it make sense to add a separate |
On Fri, May 15, 2020 at 08:54:10AM -0700, James Falcon wrote:
> Apologies if this still isn't clear; my first point stands
> regardless. :-)
I think i get it. We don't want to break user configuration normally,
so we should be wary of making changes like this if it refers to user
configuration. But #include is more of an internal change and one more
easily worked around, so we can make it here. Let me know if I'm
missing the point :smile:
Yep, I think you get it too. ;)
> Yep, you are correct that it won't work. We backport upstream
> snapshots of cloud-init to older Ubuntu releases, so xenial will be
> on 20.2 once this change ends up there. I think it's reasonable to
> hard-code this to True here, and we'll introduce a quilt patch for
> all current Ubuntu releases which will do s/True/False/. What do you
> think?
I don't know, tbh. We already have an infrastructure in place for
applying these quilt patches that seems to work, but as an outside
person coming in, it feels fairly complex. There might be aspects of
versioning that justify that complexity that I'm not aware of yet, but
if I were approaching the problem from scratch, I would put the
feature change logic into code, rather than patches. I.e., use
something from the environment or one or two code fields in a separate
file that would change per branch or release that uniquely identify
the environment we're in, and then make all of the changes based on
those fields, so the only thing that would be different between
release/branch is that unique identifier. Does that make sense?
I think that makes a lot of sense (in that I believe I understand your
proposal; I'm not sure if it's the right path forward (yet :)).
I'll definitely need to noodle on this some more, but let me jot down
some initial thoughts (and open questions, which I am not asking for you
to answer, to be clear):
* This would reduce the need for RELEASE_BLOCKER comments and manual
intervention at backport time.
* This would presumably push some Ubuntu-specific logic up into the
upstream codebase and out of our Ubuntu packaging.
* Relatedly, how would other downstreams (e.g. Debian, SUSE) handle
manipulating the feature flags if they want to retain previous
behaviour for an existing release? (It's certainly not practical to
capture _every_ downstream's feature flag logic into the upstream
codebase.)
* This proposal would move (some part of) the determination logic to
happen at runtime instead of package build time (whereas in my quilt
patch proposal, it's determined once and hardcoded, so the runtime
check is just `if <bool>`). There will be some (I assume minor)
performance cost, and some (definitely minor) increase in codebase
size.
* This would reduce the codebase delta between Ubuntu releases running
the same upstream snapshot.
* How do we test that our feature flag logic is correct?
Given that we already have this process for applying merges, I suppose
it makes the most sense to just stick with that. But in that case,
does it make sense to add a separate `features.py` file? If we're
going to have a quilt patch, should I remove this `features.py` and
move `ERROR_ON_USER_DATA_FAILURE` into `user_data.py`?
Regardless of the resolution to the above conversation, I still think a
separate `features.py` makes sense: it gives us a single place to
examine to find behavioural differences between releases, rather than
having to hunt through quilt patches. The ideal ending point for this
stream of work, I think, is our per-release quilt patches consisting
only of (a) modifications to default configuration (for things we would
like users to be able to modify), and (b) `features.py` for everything
else. (In your proposal above, I'm assuming the per-release code fields
are at the top of `features.py`; if not, then
s/`features.py`/`that_file_instead.py`/ in this ideal vision.)
|
Some comments on a few of these...
I'm assuming that's a good thing?
Curious the background on this one. Do other downstreams apply their own patches when they package that we have no knowledge of? If we know about them, aren't we already capturing their feature flags, just in a possibly less efficient manner? If we don't know about them, why is it our concern?
Shouldn't be too hard to mock patch the flags...either directly or indirectly by instead patching the release version. That also makes it easy to test expected behavior on an old release version without having to apply patches first. Do we currently have a way of testing that the patched behavior is correct?
+1 |
On Fri, May 15, 2020 at 10:50:46AM -0700, James Falcon wrote:
>* This would presumably push some Ubuntu-specific logic up into the
upstream codebase and out of our Ubuntu packaging.
I'm assuming that's a good thing?
Perhaps, but not necessarily. We try to maintain a separation between
ourselves as an upstream and as the packagers for Ubuntu downstream.
cloud-init is not only used on Ubuntu, and we want to ensure that we are
maintaining the project in a way that works for all those downstream
distros, not only ourselves. In that vein, Ubuntu isn't treated as
anything other than another distro in the cloud-init codebase: there are
no non-comment/doc/test uses of any Ubuntu codename (precise onward)
outside of `cloudinit/distros/ubuntu.py`. So introducing feature flag
logic that affects behaviour based on the running Ubuntu releases
outside of `ubuntu.py` (and, more strongly, outside of the `Distro`
subclass we determine to be the right one to use at runtime) is a
definite change to the way that distro-specific behaviour is captured
in the codebase currently. We need to weigh that very carefully.
This, perhaps, indicates that the approach to feature-flagging should be
handled through the `Distro` hierarchy instead of as a separate module?
That would make this patch, at least, substantially more complicated.
> * Relatedly, how would other downstreams (e.g. Debian, SUSE) handle
manipulating the feature flags if they want to retain previous
behaviour for an existing release? (It's certainly not practical to
capture _every_ downstream's feature flag logic into the upstream
codebase.)
Curious the background on this one. Do other downstreams apply their
own patches when they package that we have no knowledge of?
Other downstreams certainly do apply patches. SUSE has a set of patches
that robjo is gradually upstreaming, for example. I don't know about
other downstreams, but it's absolutely not unusual for downstreams to
patch software when integrating it into their distribution (and it's
arguably more likely with cloud-init, because we are bound up with
boot/networking/user creation/etc. and are therefore more heavily
integrated than most).
(I don't know any details, but you can browse through the Debian
changelog and see that they carry a number of patches:
https://metadata.ftp-master.debian.org/changelogs//main/c/cloud-init/cloud-init_20.1-2_changelog)
If we know about them, aren't we already capturing their feature
flags, just in a possibly less efficient manner?
I don't fully follow this question. "We" aren't capturing their
differing requirements at all. Rather, they are capturing them in their
packaging, wherever that is stored (and however those patches are
maintained, if they are even maintained as patches to begin with). (And
we don't know who all of "they" are; anyone could be pulling our tarball
and integrating it into their distro.)
Even in cases where we do know about downstream patches, they aren't
necessarily easy to integrate upstream. For example, they may be
hard-coding the correct behaviour for $distro in a part of the code that
doesn't have ready access to the `Distro` instance. That's fine for
their codebase, because it will only ever run on $distro, but could
require a substantial refactor to integrate it upstream (because it
would need to be made conditional).
If we don't know about them, why is it our concern?
Because we want cloud-init to be adopted as widely as possible, and kept
as up-to-date as possible where it is adopted. The harder we make it
for downstream maintainers to package new versions, the less likely we
make it that cloud-init will be recent in all places. This is bad for
our users, because it means that they have to deal with more, and older,
different versions of cloud-init if they want to use it on different
distros. And it's bad for us, and for the cloud providers, for exactly
the same reason.
> * How do we test that our feature flag logic is correct?
Shouldn't be too hard to mock patch the flags...either directly or
indirectly by instead patching the release version. That also makes it
easy to test expected behavior on an old release version without
having to apply patches first. Do we currently have a way of testing
that the patched behavior is correct?
We generally manually verify the behaviour during the SRU process in
which the delta is introduced, but we don't have any automated testing
of it specifically. This is a definite gap.
|
Thanks. I wasn't aware of the downstream impact or the fact that we want to keep our release info out of the core logic. One more possibility that I have been considering is having an initial features file with all of the default values hardcoded (like you had initially suggested), but then also having a specific extension point where branches/releases/downstreams can provide specific overrides. It could be as crude as something like this: ERROR_ON_USER_DATA_FAILURE = True
# Other flags...
try:
from feature_overrides import *
except ImportError:
pass By default the file wouldn't even exist on master. But branches/downstreams could add the file to override defaults. To me, something like this could offer a number of benefits:
|
On Mon, May 18, 2020 at 07:19:18AM -0700, James Falcon wrote:
Thanks. I wasn't aware of the downstream impact or the fact that we
want to keep our release info out of the core logic.
One more possibility that I have been considering is having an initial
features file with all of the default values hardcoded (like you had
initially suggested), but then also having a specific extension point
where branches/releases/downstreams can provide specific overrides. It
could be as crude as something like this:
```
ERROR_ON_USER_DATA_FAILURE = True
try:
from feature_overrides import *`
except ImportError:
pass
```
By default the file wouldn't even exist on master. But
branches/downstreams could add the file to override defaults.
I like this proposal a lot. I think it retains simplicity in the core
codebase, but leaves the door open for easy extension/modification by
distributors of cloud-init without prescribing any particular process on
them. Good job!
To me, something like this could offer a number of benefits:
<snip>
3. It could limit our testing burden. We could promise to test only
the upstream defaults (and any other configuration we care about).
If somebody downstream wants to change the configuration, it's
still their burden to test it.
I think this is the most important open question I have: do we only test
with the upstream default? Specifically, what does that mean for
downstreams who are "reverting" to previous behaviour? (Given that we
will be one such downstream, we have a vested interest in getting this
right. ;) I would like to be able to limit our test matrix, so let's
dig into the details.
In Ubuntu, both generally and specifically for cloud-init, we run the
unit tests shipped by upstreams during package builds, to catch
inconsistencies between the upstream test environment and an actual
installed Ubuntu environment. (And this has definitely saved us from
shipping broken cloud-init packages on older releases in the past.)
To take this specific example, if we are only testing the upstream
default value, then the commit which introduces the
`ERROR_ON_USER_DATA_FAILURE` flag (defaulting to True) will also modify
all the tests in the codebase to pass when `ERROR_ON_USER_DATA_FAILURE =
True`. If these tests are any good, then they should fail when the
behaviour of the unit under test changes (as it would when we set
`ERROR_ON_USER_DATA_FAILURE = False` in our packaging branches).
So if, as a downstream, we naively drop `ERROR_ON_USER_DATA_FAILURE =
False` into `feature_overrides.py`, then our test run will start
failing. We could work around this class of failures by (in the
upstream codebase) ensuring that any tests that rely on
`ERROR_ON_USER_DATA_FAILURE = True` are skipped if that precondition
isn't true. This improves the downstream situation somewhat: the test
run doesn't fail. However, it doesn't solve the problem entirely:
downstream has lost test coverage of the `ERROR_ON_USER_DATA_FAILURE =
False` codepaths (because upstream modified its testing to only cover
the new, `= True` behaviour, and we're skipping all those tests anyway).
From a purely downstream perspective, the simplest solution to all of
this: instead of using `feature_overrides.py`, introduce a patch which
reverts the commit that introduced the change in behaviour. This will
not only revert the behaviour (which `feature_overrides.py` could have
done), but will also revert the tests to testing the previous behaviour.
However, from a downstream perspective, managing that _simple_ solution
would be far from _easy_, so we want to avoid it. And, from an upstream
perspective, we want to avoid it being (or, more accurately, appearing
to be) the path of least resistance too, because we know that's going to
end up forcing an impractical maintenance burden on downstreams.
So this is all a long-winded way of saying that I _think_ we need to
test all branches that a feature flag will end up leading us down, so
that downstreams aren't all left with that burden (or under-tested
changes).
In some senses, this is the flip-side of the "we're both upstream and
downstream" coin: it might be strictly acceptable for downstreams to be
left responsible for testing the non-upstream defaults, but we _are_ one
of those downstreams, so we're going to on the hook for handling that
testing one way or another. So then the question is: where is it most
efficient for us to handle it, and is that place Wrong (TM) for some
reason? And I think that testing both branches of a feature flag would
be more efficient in the upstream codebase, and as these are generic
feature flags (rather than being specifically fitted to Ubuntu's needs),
I don't think there is anything dissonant about those tests sitting in
the upstream codebase.
A side note, another open question: do we need a defined
lifetime/deprecation policy for feature flags? From an Ubuntu POV, we
won't need ERROR_ON_USER_DATA_FAILURE once we are no longer backporting
full upstream releases to focal (which could be as *ahem* early as
2025). If there are no other downstreams requiring the flag at that
point, we could remove it (and the `= False` codepaths) from
cloud-init's codebase.
|
I really like the idea of a single feature.py module which captures hard-coded defaults T/F to capture features that we know are toggled differently on different downstreams (given that we know that xenial may not adopt FEATURE_X because it may introduce unexpected behavioral changes. I also like the optional feature_overrides module you suggested specifially because your you suggestion about patch collision avoidance. What gap I feel like we still aren't tracking though is that we don't know in master that ubuntu/xenial has a feature overridden. Could we express this in tip so we don't have to provide an override module in the ubuntu/* downstream branches? What do folks think about adding a feature_overrides module something like the following which downstreams could drop or modify as they see fit? sys_info = util.system_info()
if sys_info['variant'] == 'ubuntu' and sys_info['dist'][2] in ('xenial', 'bionic'):
ERROR_ON_USER_DATA_FAILURE = False
I like that a lot.
I wonder whether we could at least provide an override in master for the known feature disabling we are going to do to separately applying patches to ubuntu/xenial|bionic|focal etc when as part of our SRU process. Those work items are more likely to be forgotten without a process around RELEASE_BLOCKER or SRU_BLOCKER comments in the code.
Since our ubuntu/* branches (and SRU's to xenial. bionic etc) are the downstream, we're still on the hook there. |
While we have a feature flag settable in our code, I'd vote for upstreams 'burden' as:
The reason I suggest the mock of that feature value is so unittests don't have to get exposed to whether the the code happens to introduce a feature_override, since we know this is an expected behavior in downstreams (specifically ours, but broadly other downstreams)
+1
@OddBloke, I think dropping a feature flag entirely makes sense when all our Ubuntu downstream support of that disabled flag ends. I'm presuming that other downstreams will continue to patch out behavior that they find irrelevant, if we have a feature boundary. Downstreams can revert that change if they want to continue using that feature. |
(this comment was supposed to go up right when I pushed my last commit, but for some reason it never went through. @blackboxsw , I typed this up before you posted your most recent comments) I pushed a commit with (most) of the changes needed for this particular feature. A summary of what's being proposed (mostly copied from an earlier comment): ERROR_ON_USER_DATA_FAILURE = True
# Other flags...
try:
from feature_overrides import *
except ImportError:
pass By default the file wouldn't even exist on master. But branches/downstreams could add the file to override defaults. Benefits:
We would be responsible for all possible flag paths. Quick comment on terminology after a quick googling... A quote from Martin Fowler:
|
@blackboxsw Can you explain your thinking more here? I'm not sure I understand. Why should master care about what xenial is doing? Shouldn't our own downstreams function the same as others (non-ubuntu)? Why do you think it would be simpler to express this in master rather than committing an override once in a downstream branch? |
There are a few meager benefits I can think of (but not a strong argument for):
Again, not a strong opinion, and I can be easily swayed. The more I think about it, we can probably still decouple the feature disabling work from our SRU release process by adding the downstream feature_override values regardless of whether that specific feature flag support is actually read by the ubuntu/* release branch because it's only a variable that happens to be unreferenced until the next |
Thanks for this interesting conversation and there's a lot of details on how things work today in there. It sounds like things are heading a little more feature flag-like in that there is a "is this on or off" type setup. I challenge the "support burden" of having the flags in the upstream codebase as the burden is there regardless. In the case of it applied downstream there's the discussion of the test requirements of that. Ideally, cloud-init would be able to test its own expected behavior if the feature is on or off and be able to do it in a more "unit-test" style in that you can get close to the code and validate the flag does what it says it does. However, if downstream releases were flipping that flag and wanting assurances at a broader functional test level that would be the downstream burden to validate and be sure of. For instance, in the case of Ubuntu series behaving differently we'd have a more pure upstream test cases that verify cloud-init behaves, and broader series tests that verify they continue to work in their expected feature configuration. My main concern with the quilt patches is that if you're chasing the problem in code you can't validate what code is actually being run/seen until you evaluate not just what's in the upstream tree, but also in these patches. Which patches apply? In what order? It makes it much more difficult in my mind to reason about what the person with an issue is struggling with. It also means it's near impossible to perform any sort of testing without going through the build process as part of that. That means that the iteration speed of the work is significantly impacted. Obviously we're not going to change the world, but I'm extremely hesitant to build upon. Personally, I'd rather see flags like the off on the side file that is read/processed such that downstream can provide a single "overrides" or "flags" file that flips things. Rather than a list of patches anything they'd have would be clearly in a specific file and you can trace the history of that file over time if needed to aid in finding older issues vs tracking the existence of patch files that came and went across time. |
On Wed, May 20, 2020 at 05:01:44AM -0700, Rick Harding wrote:
I challenge the "support burden" of having the flags in the upstream
codebase as the burden is there regardless.
I don't believe that anyone made the argument that the flags should not
be in the upstream codebase, and I certainly didn't in the quoted
comment.
You're quoting me saying "introducing a boolean configuration option
effectively doubles our potential support burden", but I am specifically
referring to _configuration options_ there, not to feature flags
(as-proposed here). I'm saying that if we allow this to be configured
at runtime, then we have to support both paths on all Ubuntu releases
(on/off on xenial/bionic/eoan/focal/groovy; 10 different configurations
right now, and 2 more for every new Ubuntu release). If it's configured
at package build time, with a feature flag, then we only need to support
one behaviour on each release (off on xenial/bionic/focal/eoan, on on
groovy+; 5 configurations right now, and 1 more for each new release).
My main concern with the quilt patches is that if you're chasing the
problem in code you can't validate what code is actually being
run/seen until you evaluate not just what's in the upstream tree, but
also in these patches.
I just want to take a moment to ensure that we level set: quilt patches
are a standard way for Debian packages to apply changes to the tarball
distributed by upstreams; we are using them in a way that is totally
appropriate and normal in packaging. [0] is a blog post entitled "How
to use quilt to manage patches in Debian packages", which I would
recommend reading. (I'll also note that when I say this is "normal in
packaging": the first sentence of that post is "Most Debian source
packages are using the “3.0 (quilt)” format" and the author has been a
Debian Developer since last millennium.)
[0] https://raphaelhertzog.com/2012/08/08/how-to-use-quilt-to-manage-patches-in-debian-packages/
I'm not going to say that the quilt packaging process is smooth or
anything, but I want to make sure it's understood that this is not an
NIH process on our team. (And if you don't like the look of our quilt
patches, wait until you see the quilt patches in the other squad's
packages. ;)
Which patches apply? In what order? It makes it much more difficult in
my mind to reason about what the person with an issue is struggling
with.
cloud-init is a "3.0 (quilt)" Debian package[1]. As with all "3.0
(quilt)" packages, the quilt series file is at debian/patches/series[2],
and as with all quilt series files, the patches will be applied in the
order listed there.
[1] https://github.com/canonical/cloud-init/blob/ubuntu/xenial/debian/source/format
[2] https://github.com/canonical/cloud-init/blob/ubuntu/xenial/debian/patches/series
It also means it's near impossible to perform any sort of testing
without going through the build process as part of that.
Firstly, you don't need to go through the build process to reproduce the
tree it will build from: `git checkout ubuntu/xenial; quilt push -a`
gives you that. Secondly, we've established that we are going to test
both values that a feature flag can take in the upstream codebase. So
"near impossible to perform any sort of testing" is hyperbolic.
It's also somewhat beside the point: the package build process takes
~45s on my machine, and that includes running the unit tests and
lintian. If you're iterating quickly, you don't need to do either of
those; then the build process takes ~12s on my machine. For reference,
running `tox -e py3` takes ~a minute on my system (even without a
recreate). So even if you did need to run through the build process for
any and all testing, we are not talking a major time cost.
(Let me be clear: I think the scripting situation around building and
testing packages in dev is not where I would like it to be. We have two
different helpers (tools/bddeb in cloud-init and build-package in
uss-tableflip), and it's not always clear to me which of those is
appropriate to use in which situation. I _absolutely_ want us to clear
up the tooling situation, but we need to be careful to distinguish
between problems in our tooling around the process, and problems in the
process itself.)
That means that the iteration speed of the work is significantly
impacted. Obviously we're not going to change the world, but I'm
extremely hesitant to build upon.
As I note above, it's not obvious to me that there's a significant
impact once people get up-to-speed on the process and tooling (which, as
I also note above, is _not_ easy to do at the moment).
Personally, I'd rather see flags like the off on the side file that is
read/processed such that downstream can provide a single "overrides"
or "flags" file that flips things.
Yep, I think we're all agreed on this.
Rather than a list of patches anything they'd have would be clearly in
a specific file and you can trace the history of that file over time
if needed to aid in finding older issues vs tracking the existence of
patch files that came and went across time.
So there is a problem with introducing these as files in the git tree.
Some background: currently, we produce orig tarballs that are named like
this:
cloud-init_20.1-50-g7f9f33db.orig.tar.gz
which is the string before the Debian information in the version in
debian/changelog (which, on xenial for example, is
"20.1-50-g7f9f33db-0ubuntu1~16.04.1").
Note that there is no release-specific information in that tarball name.
All cloud-init packages across Ubuntu series share the same orig
tarball, and any series-specific changes are represented in the debian
tarball, which in this case is named:
cloud-init_20.1-50-g7f9f33db-0ubuntu1\~16.04.1.debian.tar.xz
Note that this _does_ contain series-specific information. This is,
broadly speaking, how (non-native) packages in Debian work: the code
provided by upstream is kept in its own tarball (ideally the exact
tarball shipped by upstream), and any Debian/Ubuntu-specific
modifications are scoped to the debian/ directory which is stored in the
debian tarball.
OK, background over, here's the concrete problem: if we introduce
changes outside of the debian/ directory in each series, then all SRU
uploads but the first one will fail, because Launchpad will (correctly)
detect that we're attempting to upload a different orig tarball under
the same name. We _are_ a downstream in Ubuntu, so we shouldn't be
modifying the code provided by upstream.
This is also important from an Ubuntu developer/(power-)user
perspective: if you pull the package from the archive (via `apt-get
source`, `pull-lp-source`, or `git ubuntu clone`, for example) then you
get all of the information about what is specific to this series in the
place you expect: the debian/ directory. If we've somehow munged some
of the series-specific information into the "upstream" code, then you
really need to know about this special package and its special
configuration, and probably pull the upstream packaging git branch
(_not_ the git-ubuntu git branch) to have any clue what's going on.
(Speaking as someone who regularly uses the above methods when debugging
other packages in the archive, I would be Very Annoyed (TM) if I
discovered that a package was doing something to break this standard
workflow.)
So all of this is why I proposed that each series carry a quilt patch
which does _all_ of the feature flag modification: it gives us a
specific file to look at, and doesn't require a major (and, IMO,
probably incorrect) reshuffling of all of our packaging.
James improved upon my proposal, of course, so the current proposal
would be: each series will carry a quilt patch which introduces
`feature_overrides.py`.
Cheers,
Dan
|
Thanks, I had some confusion that the patches were of the more code variety vs the flags on vs off. It seems that not the idea at all and the patch is more about setting what the true/false settings are for flags that are achievable in the primary codebase. In looking at the history of patches that makes sense as well. The only one that gives me the :/ is the one around ubuntu-advantage-revert-tip.patch but I assume there's a story there. Thanks for the clarification and a good amount of education on the process and specifics. |
bb6907a
to
583bc19
Compare
@OddBloke Can you take another look at my last two commits? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updated documentation here explaining the intent behind feature flag development. I'm good with your approach to separate the feature_overrides.py to specific downstream branches via quilt patches. It's easy enough for us to track them in the release-specific downstreams.
Thanks @OddBloke and @TheRealFalcon for all the discussion here on this.
cloudinit/tests/test_features.py
Outdated
@@ -0,0 +1,57 @@ | |||
""" | |||
This file is for testing the feature flag functionality itself, | |||
NOT for testing that any individual feature flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/that any/any/
cloudinit/tests/test_features.py
Outdated
# we call this fixture, cloudinit.feature_overrides doesn't exist yet, so | ||
# we need to reload cloudinit.features first before we can reload | ||
# cloudinit.feature_overrides, hence the three calls here. | ||
reload(cloudinit.features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow thanks for this. I've never played with importlib directly. Great contextual comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just reviewed the feature flag logic thus far, not the actual use of the feature flag being introduced.
cloudinit/features.py
Outdated
@@ -0,0 +1,23 @@ | |||
# This file is part of cloud-init. See LICENSE file for license information. | |||
# pylint: disable=wildcard-import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit (because it probably won't ever matter here), but we could move this closer to line 21 (or onto line 21, potentially) to avoid disabling this warning for the entire file.
HACKING.rst
Outdated
Feature Flags | ||
------------- | ||
|
||
Feature flags are used as a way to easily toggle configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of ~replicating the content here in two places, I would suggest using something like:
.. automodule:: cloudinit.features
which will include the docstring from cloudinit.features
inline. Then we can just maintain this information in one place (and it's easily accessible via both the documentation and in the appropriate place in the source code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cloudinit/features.py
Outdated
the flag and intended lifetime | ||
""" | ||
|
||
# If there is a failure in obtaining user data (i.e., #include or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make this a docstring, then we may be able to include it in the generated documentation relatively easily by modifying my earlier proposal to:
.. automodule:: cloudinit.features
:members:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cloudinit/tests/test_features.py
Outdated
@pytest.mark.parametrize('create_override', | ||
[{'ERROR_ON_USER_DATA_FAILURE': False}], | ||
indirect=['create_override']) | ||
def test_feature_with_override(self, create_override): | ||
from cloudinit.features import ERROR_ON_USER_DATA_FAILURE | ||
assert ERROR_ON_USER_DATA_FAILURE is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indirectly parameterising a fixture that we are requesting in this test definition seems a little strange. I wonder if this is indicating that create_override
might be better as a context manager? This would make this test definition something like:
def test_feature_with_override(self):
with create_override({'ERROR_ON_USER_DATA_FAILURE': False}):
from cloudinit.features import ERROR_ON_USER_DATA_FAILURE
assert ERROR_ON_USER_DATA_FAILURE is False
(I think the only change required in create_override
would be to wrap the yield
in a try
/finally
, and to decorate it with contextlib.contextmanager
.)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think of fixtures as a way of achieving a common setup/teardown for tests. I agree that the syntax is a little wonky, but the indirect parametrization is really just a way of passing arugments to a fixture. To me that seems more idiomatic (in a pytest sense) than rolling my own with context managers. That said, it's not a strongly held view, so if you'd really prefer I change it, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a great point about using fixtures as a common setup/teardown mechanism. Let's stick with it.
One thing I think we can do to reduce the repetition slightly is pass indirect=True
instead of repeating the fixture name again (I just tested locally and the tests still passed).
cloudinit/tests/test_features.py
Outdated
reload(cloudinit.features) | ||
reload(cloudinit.feature_overrides) | ||
reload(cloudinit.features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can actually simplify this a little by removing the internally-cached feature_overrides
module if present, then reloading cloudinit.features
(which will then have to reimport our updated feature_overrides
:
reload(cloudinit.features) | |
reload(cloudinit.feature_overrides) | |
reload(cloudinit.features) | |
sys.modules.pop('cloudinit.feature_overrides', None) | |
reload(cloudinit.features) |
and we could potentially even do the same for cloudinit.features
(but that could cause problems with other tests so I wouldn't recommend it).
Would it make more sense for me to move the functionality around the flag being introduced into a separate PR and rename this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, just focused on the feature flag functionality; will complete my review of the rest of the code later today.
cloudinit/tests/test_features.py
Outdated
@pytest.mark.parametrize('create_override', | ||
[{'ERROR_ON_USER_DATA_FAILURE': False}], | ||
indirect=['create_override']) | ||
def test_feature_with_override(self, create_override): | ||
from cloudinit.features import ERROR_ON_USER_DATA_FAILURE | ||
assert ERROR_ON_USER_DATA_FAILURE is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a great point about using fixtures as a common setup/teardown mechanism. Let's stick with it.
One thing I think we can do to reduce the repetition slightly is pass indirect=True
instead of repeating the fixture name again (I just tested locally and the tests still passed).
cloudinit/features.py
Outdated
""" | ||
If there is a failure in obtaining user data (i.e., #include or | ||
decompress fails), old behavior is to log a warning and proceed. | ||
After the 20.2 release, and we instead raise an exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the 20.2 release, and we instead raise an exception. | |
After the 20.2 release, we instead raise an exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual functionality change itself LGTM, with a couple of minor comments. Thanks!
cloudinit/user_data.py
Outdated
if source_exception: | ||
raise Exception(error_message) from source_exception | ||
else: | ||
raise Exception(error_message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested on xenial's Python and raise <exception> from None
appears to behave the same as raise <exception>
:
Python 3.5.2 (default, Apr 16 2020, 17:47:17)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> raise Exception()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
Exception
>>> raise Exception() from None
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
Exception
so we can potentially simplify this to:
if source_exception: | |
raise Exception(error_message) from source_exception | |
else: | |
raise Exception(error_message) | |
raise Exception(error_message) from source_exception |
26fcc71
to
9baa73f
Compare
Addressed comments and altered the way of importing feature_overrides due to the difference in errors that can be raised depending on python version |
cloudinit/tests/test_features.py
Outdated
try: | ||
import cloudinit.features # noqa | ||
except Exception as e: | ||
assert type(e) in (NameError, ImportError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this test to be checking that an invalid override doesn't cause these exceptions to bubble out of an import of cloudinit.features
(so that the system can continue to function if the presence of broken overrides), not that they do bubble out.
(Also we'll want to test and handle SyntaxError
too: I think {'SPAM: int': 'EGGS'}
would do it.)
@OddBloke Reverted the exception stuff we talked about and added a conversation item to tomorrow agenda about dealing with exceptions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Build time feature flags are now defined in cloudinit/features.py. Feature flags can be added to toggle configurtion options or deprecated features. Feature flag overrides can be placed in cloudinit/feature_overrides.py. Futher documentation can be found in HACKING.rst. Additionally, updated default behavior to exit with an exception if #include can't retrieve resources as expected. This behavior can be toggled with a feature flag. LP: #1734939
1f3f727
to
1b23ea4
Compare
@OddBloke Rebased, squashed, and added a commit message. Feel free to modify the commit message if you think I missed any important details. |
WIP PR is for discussion, not review.
If an
#include
line fails, we only get a warning in the stdout log, but no failure in theresult.json
. For more background, see https://bugs.launchpad.net/cloud-init/+bug/1734939I replaced the possible log warnings with exceptions (including a similar warning for a failed decompression I saw in the same file). I now get this in the
result.json
(the URL intentionally has an extra character at the end):Is this too blunt an instrument? Previously, if one piece failed, we'd log the warning and continue. Raising an exception stops any additional processing. Do we want that? I was initially thinking no, but if you have a later piece that relies on an earlier piece having run, you probably don't want the later piece being run, do you?
Is there currently any other way of signaling errors without raising exceptions?
If we do want to use exception, is
Exception
ok here, or would custom exceptions be useful here?