Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

PVF: add landlock sandboxing #7303

Merged
merged 19 commits into from
Jul 5, 2023
Merged

PVF: add landlock sandboxing #7303

merged 19 commits into from
Jul 5, 2023

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented May 28, 2023

Pull Request

Overview

Landlock is a new sandboxing mechanism in Linux. Unfortunately it only sandboxes filesystem access right now. Also, it's 5.13+ so not all kernel versions support it. However, if support is not enabled calling landlock is simply a noop. It was easy to add and nice to have in the interim until full sandboxing.

TODO

  • Implement check_enabled
  • Enable telemetry (to see how many validators have landlock enabled).
    • The telemetry data would ideally be private so that attackers can't exploit this data. I don't think we currently support private telemetry and it shouldn't really be a blocker for this PR.
  • Add section to validator's guide about upgrading the kernel with landlock support.
    • Decided not to do this because the validator guide recommends using a VPS, so it wouldn't be very easy to upgrade: "The most common way for a beginner to run a validator is on a cloud server running Linux. You may choose whatever VPS provider that your prefer."
    • [Jun 6 2023] On further research, most distros should have landlock enabled.
    • [Jun 6 2023] Added note to the guide here.

Related

Closes #7243

@mrcnski mrcnski added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. labels May 28, 2023
@mrcnski mrcnski self-assigned this May 28, 2023
@mrcnski mrcnski marked this pull request as draft May 28, 2023 17:44
@mrcnski mrcnski marked this pull request as ready for review May 31, 2023 14:26
@mrcnski
Copy link
Contributor Author

mrcnski commented May 31, 2023

I've either implemented or crossed-out all TODO items. Should be ready for review.

@mrcnski mrcnski requested a review from eskimor May 31, 2023 14:35
Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks @mrcnski !

cond_notify_on_done(
|| {
#[cfg(target_os = "linux")]
let _ = crate::worker::security::landlock::try_restrict_thread();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, instead of simply really doing nothing, I think we should issue a warning that landlock is not available and for maximum security encourage the operator to upgrade their Kernel/make sure landlock is available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for non Linux - a warning that the landlock sandboxing is not available would be a good idea, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can emit a warning (once on host startup I think), but the way the validator guide is written right now, I'm not sure how much control a validator can have on their kernel version/configuration:

"The most common way for a beginner to run a validator is on a cloud server running Linux. You may choose whatever VPS provider that your prefer."

I guess we could recommend "try to find a VPS with landlock enabled", but unless we give specific recommendations, it might be hard to find this info.

For people not running on a VPS, I found a guide for enabling landlock, but I haven't tested it, and I'm not a Linux expert. I'm not sure we should be officially recommending random guides from the internet...

Advice would be really appreciated. Let me know what you think!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can emit a warning (once on host startup I think)

Hmm, so given the above, I'm not sure how actionable a warning would be. It could just be noise for most operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out that landlock is enabled by default on most distros including Ubuntu 22.04 LTS (what we recommend). And I don't see any reason why a VPS would explicitly opt-out of a security feature. So, we can simply emit a warning like this:

WARNING: Could not enable landlock, a Linux kernel security feature. Running validation of PVF code may compromise this machine. Consider upgrading the kernel version for maximum security.

This comment was marked as duplicate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if we consider this the recommended way of running the validator then we should refuse to start if the feature is not present and build in an explicit cmdline argument to bypass this check.

That's a good idea, I've added it to #7073 (see description). I don't think we can mandate it right now because like half of validators are on too-old kernel versions (according to the visible telemetry data). We already need to announce that "secure mode" change in advance, maybe we can also mention that validators should upgrade to kernel 5.13+ to avoid running in "insecure mode". I will bug Will again to see if we can get that announcement going.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, makes sense then. So, keep the warning for now and make it mandatory at a later date.

Copy link
Member

@eskimor eskimor Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this makes sense, I would nevertheless suggest a soft launch:

  1. Enable the feature and print warning on startup if not available, hinting that this feature will be mandatory on some future release XY and that the operator should upgrade the machine. In addition this should obviously be part of the release notes as they are less likely to be ignored.
  2. Release version XY which makes it mandatory, exiting with a warning that the check can be bypassed by that command line flag.

Updating the wiki would be good as well, but is least likely to be read by already operating operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eskimor I missed your comment before merging. Here's what we have now, let me know what needs a follow-up:

node/core/pvf/src/host.rs Outdated Show resolved Hide resolved
node/core/pvf/common/src/worker/mod.rs Outdated Show resolved Hide resolved
@mrcnski
Copy link
Contributor Author

mrcnski commented Jun 1, 2023

I've been pondering whether we should enable ABI V2 instead of V1, even though
only the latter is supported by our reference kernel version. If we were on V2,
then landlock would use that if it was supported on the current machine, giving it
stronger security, and just fall back to V1 if not supported.1

The issue I see with that is indeterminacy. If half of validators were on V2 and
half were on V1, they may have different semantics on some PVFs. So a malicious
PVF now has a new attack vector: they can exploit this indeterminism between
landlock ABIs! But this is exactly the kind of thing we want to prevent!

So, we have to stick only to the latest ABI supported by our reference kernel
version (right now ABI V1). If a validator's machine does not fully support it,
we can't let them run as a secure validator.

The only caveat with that is that we want to make running securely actually
realistic for users, so we don't have a significant proportion of validators
running as insecure-i-know-what-i-do. If there were enough of those, it would
again provide some indeterminism that could be exploited - e.g. if 33% weren't on
a kernel that supported landlock, and found it easier to just pass the insecure
flag than upgrade.

So, takeaways:

  1. Always use a reasonable ABI that most validators can fully support, and require full
    landlock enablement to run securely.
  2. Too many validators on insecure-mode can be a source of indeterminism.
  3. We should monitor with telemetry how many validators are secure vs.
    insecure
    . This shouldn't use the public telemetry server.

I've documented the determinism concerns in a new commit.

Also, I think this PR should be burned-in on Versi before merging.

Footnotes

  1. The current versions are: reference kernel: 5.16+ | V1: 5.13 | V2: 5.19

@mrcnski
Copy link
Contributor Author

mrcnski commented Jun 1, 2023

Could use a re-review to sanity check the last two commits. I'll do a burn-in meanwhile.

@mrcnski mrcnski requested a review from eskimor June 1, 2023 22:25
/// we were on V2, then landlock would use V2 if it was supported on the current machine, and
/// just fall back to V1 if not.
///
/// The issue with this is indeterminacy. If half of validators were on V2 and half were on V1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic doesn't parse well for me, if V1 is less restrictive and makes validators vulnerable then I would guess having just a part of them vulnerable is better than all of them isn't it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question and I wondered the same. The issue is that it opens up a possibility of different behavior on different validators, and this itself can be exploited to attack consensus. Say that in the future we enable ABI V3 and executing some PVF tries to truncate a file (which will be banned on ABI V3)1 - some validators will error and some won't. If the split in voting among validators is less than 33%, there will be a dispute and all the losing validators will get slashed. If the split is more than 33%, it violates our assumptions about consensus and finality will stall.

So indeed there is a very interesting trade-off between security for the validator and security for the chain, and I think we have to prioritize the latter while providing as much validator security as possible. If a small amount of validators are behind in security and vote wrongly then some slashing is okay, and it can be reverted with governance, but I think we really don't want finality stalls.

Although, I guess it would also be really bad if a bunch of validator keys got stolen and an attacker started impersonating them... And anyway there are other sources of indeterminacy to attack the chain with... Fortunately ABI V1 already fully protects against reading unauthorized data, so in this case it is enough to protect validators' keys and it is still the correct decision. (The only other thing I would want to feel safe is to block all network access. Maybe it's possible to set up a firewall per-thread?)

There are similar considerations that made the seccomp sandboxing harder than anticipated. Maybe @eskimor can double-check my analysis.

Footnotes

  1. Using V3 in this example because V2 doesn't actually provide additional restrictions on top of V1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation, makes sense now! I guess V1 is the best we can do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be the default for validators to have these security measures in place, ideally we would have none without them. Anyhow, the risk of disputes should be very low as this is already a second level defense mechanism. I would rather have a dispute than some PVF being able to read the validator's disk. We should make damn sure that there are no legitimate disk accesses of course, but checking that should be independent of PVF or candidate, so also rather easy. At least at the moment, I can't think of a legitimate disk access that would only happen on some PVFs ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eskimor. Determinism is still a goal, and given that ABI v2 and v3 don't add to the security I would stick to v1 here.1 I will update the docs as the determinacy is still relevant but not the only factor. And if in the future a version is released with meaningful new features like network blocking (which is an eventual goal of landlock) we can enable it immediately. We should keep in mind that attackers can exploit any difference in operation to attack the chain, but the risk is low and there are other indeterminacy vectors anyway.

Footnotes

  1. v2 adds a new config option which we don't use, and v3 additionally blocks file truncation - which might be annoying, but not really critical to security, and validators should have backups, right?

node/core/pvf/execute-worker/src/lib.rs Show resolved Hide resolved
Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @mrcnski ! Before merging we should make sure we run on CI machines which support landlock and also without such support, just to make sure things work fine in both cases.

node/core/pvf/src/host.rs Outdated Show resolved Hide resolved
node/core/pvf/src/host.rs Outdated Show resolved Hide resolved
node/core/pvf/common/src/worker/security.rs Show resolved Hide resolved
mrcnski and others added 4 commits June 2, 2023 08:06
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
@mrcnski
Copy link
Contributor Author

mrcnski commented Jun 2, 2023

I raised w3f/polkadot-wiki#4872, who do I annoy to get it reviewed? 🙃

@mrcnski
Copy link
Contributor Author

mrcnski commented Jun 5, 2023

Required before merging:

@mrcnski mrcnski added B1-note_worthy Changes should be noted in the release notes T0-node This PR/Issue is related to the topic “node”. and removed B0-silent Changes should not be mentioned in any release notes T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. labels Jun 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B1-note_worthy Changes should be noted in the release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T0-node This PR/Issue is related to the topic “node”.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

PVF worker: consider securing with landlock
4 participants