Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auditbeat] User dataset: Numerous fixes to error handling #10942

Merged
merged 7 commits into from
Mar 11, 2019

Conversation

cwurm
Copy link
Contributor

@cwurm cwurm commented Feb 26, 2019

When testing the user dataset on Fedora 29, I noticed it's not working. Digging into it, there were a few issues, all of which this PR fixes:

  1. When calling C functions via cgo, we should not rely on the error return value to determine if it failed. In most cases, there is no guarantee that errno is going to be zero on success (see here for some detail). Instead, we should check the return value first, and if that indicates that there might have been an error (usually when it's NULL/nil) we should check the error return.
  2. getpwent(3) explicitly states If one wants to check errno after the call, it should be set to zero before the call. errno cannot be accessed directly in Go code, so we introduce a helper function setErrno for that. getspent(3) does not have the same warning, but given that at least glibc uses the same code path for both, it's reasonable to assume the same applies. So we set errno to 0 before calling both functions.
  3. errno is thread-local, and the function families setpwent/endpwent/getpwent and setspent/endspent/getspent are not thread-safe, so we introduce runtime.LockOSThread()/UnlockOSThread() to make sure all C functions are run on the same OS thread.
  4. Usually, when iterating through users getpwent() should return NULL and errno should be zero when all entries have been returned. However, there is a bug in systemd that causes it to set errno to ENOENT even when there is no error (nss-systemd's getpwent sets errno to ENOENT systemd/systemd#9585). This bug affects at least Fedora 29. Following this change ENOENT is treated as if there is no error. It's not supposed to be a valid error value for getpwent() anyway, so should not happen anytime outside this bug.
  5. The previous systemd bug caused the user dataset to return no users, even though all users had been read successfully. This PR changes to gathering errors in multierrors and returning them alongside whatever data could be read. This is in line with wanting to make the System module more resilient to non-fatal failures during data collection (better to return some things alongside an error than no things at all).
  6. The system test will now fail when there are WARN/ERROR messages in the log. This change can probably also be made for the other datasets.

Writing up this PR description I'm realizing this combines quite a few issues. I could split it into multiple PRs if anybody wants, though most would have only a few lines of changed code. Because some indentation changed it's best to check Github's No Whitespace button.

@cwurm cwurm added bug review needs_backport PR is waiting to be backported to other branches. Auditbeat SecOps labels Feb 26, 2019
@cwurm cwurm requested a review from a team as a code owner February 26, 2019 11:42
@elasticmachine
Copy link
Collaborator

Pinging @elastic/secops

}
for _, user := range users {
event := ms.userEvent(user, eventTypeState, eventActionExistingUser)
event.RootFields.Put("event.id", stateID.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stateID will be an all zero uuid if uuid.NewV4() failed. What's the impact of this on the metricset and the consumers of this data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When sending a list of all users, the event.id is what they have in common, to make it easy for consumers to correlate them. If generating it failed, it would no longer be possible to use it for that purpose - lists of users sent from the same system at different times would have the same (zero) event.id. Instead, they would have to fall back on the less user-friendly timestamp.

It's unlikely to happen, but if it did the dataset would still send all the users, and an error as well.

x-pack/auditbeat/module/system/user/users_linux.go Outdated Show resolved Hide resolved
Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything that needs to be put into the changelog? Or is this unreleased.

@cwurm
Copy link
Contributor Author

cwurm commented Mar 6, 2019

Is there anything that needs to be put into the changelog? Or is this unreleased.

Good point, I've added an entry.

@cwurm cwurm merged commit 8ce5440 into elastic:master Mar 11, 2019
@cwurm cwurm deleted the user_fedora29 branch March 11, 2019 16:15
@cwurm cwurm removed the needs_backport PR is waiting to be backported to other branches. label Mar 11, 2019
cwurm pushed a commit to cwurm/beats that referenced this pull request Mar 11, 2019
…0942)

Fixes a number of issues in the `user` dataset discovered while testing on Fedora 29:

* When calling C functions via cgo, don't rely on errno as an indicator of success - check the return value first.
* Set errno to 0 before calling getpwent(3) or getspent(3).
* Make C function calls thread-safe.
* Address a systemd bug where it can return ENOENT even when there is no error.
* Don't fail on every error.
* Fail system test on WARN/ERROR messages in the log.

(cherry picked from commit 8ce5440)
@cwurm cwurm added the v6.7.0 label Mar 11, 2019
cwurm pushed a commit to cwurm/beats that referenced this pull request Mar 11, 2019
…0942)

Fixes a number of issues in the `user` dataset discovered while testing on Fedora 29:

* When calling C functions via cgo, don't rely on errno as an indicator of success - check the return value first.
* Set errno to 0 before calling getpwent(3) or getspent(3).
* Make C function calls thread-safe.
* Address a systemd bug where it can return ENOENT even when there is no error.
* Don't fail on every error.
* Fail system test on WARN/ERROR messages in the log.

(cherry picked from commit 8ce5440)
@cwurm cwurm added the v7.0.0 label Mar 11, 2019
cwurm pushed a commit to cwurm/beats that referenced this pull request Mar 11, 2019
…0942)

Fixes a number of issues in the `user` dataset discovered while testing on Fedora 29:

* When calling C functions via cgo, don't rely on errno as an indicator of success - check the return value first.
* Set errno to 0 before calling getpwent(3) or getspent(3).
* Make C function calls thread-safe.
* Address a systemd bug where it can return ENOENT even when there is no error.
* Don't fail on every error.
* Fail system test on WARN/ERROR messages in the log.

(cherry picked from commit 8ce5440)
@cwurm cwurm added the v7.2.0 label Mar 11, 2019
cwurm pushed a commit that referenced this pull request Mar 12, 2019
…11190)

Fixes a number of issues in the `user` dataset discovered while testing on Fedora 29:

* When calling C functions via cgo, don't rely on errno as an indicator of success - check the return value first.
* Set errno to 0 before calling getpwent(3) or getspent(3).
* Make C function calls thread-safe.
* Address a systemd bug where it can return ENOENT even when there is no error.
* Don't fail on every error.
* Fail system test on WARN/ERROR messages in the log.

(cherry picked from commit 8ce5440)
cwurm pushed a commit that referenced this pull request Mar 12, 2019
…11189)

Fixes a number of issues in the `user` dataset discovered while testing on Fedora 29:

* When calling C functions via cgo, don't rely on errno as an indicator of success - check the return value first.
* Set errno to 0 before calling getpwent(3) or getspent(3).
* Make C function calls thread-safe.
* Address a systemd bug where it can return ENOENT even when there is no error.
* Don't fail on every error.
* Fail system test on WARN/ERROR messages in the log.

(cherry picked from commit 8ce5440)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants