[Auditbeat] User dataset: Numerous fixes to error handling #10942

cwurm · 2019-02-26T11:42:31Z

When testing the user dataset on Fedora 29, I noticed it's not working. Digging into it, there were a few issues, all of which this PR fixes:

When calling C functions via cgo, we should not rely on the error return value to determine if it failed. In most cases, there is no guarantee that errno is going to be zero on success (see here for some detail). Instead, we should check the return value first, and if that indicates that there might have been an error (usually when it's NULL/nil) we should check the error return.
getpwent(3) explicitly states If one wants to check errno after the call, it should be set to zero before the call. errno cannot be accessed directly in Go code, so we introduce a helper function setErrno for that. getspent(3) does not have the same warning, but given that at least glibc uses the same code path for both, it's reasonable to assume the same applies. So we set errno to 0 before calling both functions.
errno is thread-local, and the function families setpwent/endpwent/getpwent and setspent/endspent/getspent are not thread-safe, so we introduce runtime.LockOSThread()/UnlockOSThread() to make sure all C functions are run on the same OS thread.
Usually, when iterating through users getpwent() should return NULL and errno should be zero when all entries have been returned. However, there is a bug in systemd that causes it to set errno to ENOENT even when there is no error (nss-systemd's getpwent sets errno to ENOENT systemd/systemd#9585). This bug affects at least Fedora 29. Following this change ENOENT is treated as if there is no error. It's not supposed to be a valid error value for getpwent() anyway, so should not happen anytime outside this bug.
The previous systemd bug caused the user dataset to return no users, even though all users had been read successfully. This PR changes to gathering errors in multierrors and returning them alongside whatever data could be read. This is in line with wanting to make the System module more resilient to non-fatal failures during data collection (better to return some things alongside an error than no things at all).
The system test will now fail when there are WARN/ERROR messages in the log. This change can probably also be made for the other datasets.

Writing up this PR description I'm realizing this combines quite a few issues. I could split it into multiple PRs if anybody wants, though most would have only a few lines of changed code. Because some indentation changed it's best to check Github's No Whitespace button.

elasticmachine · 2019-02-26T11:42:33Z

Pinging @elastic/secops

andrewkroh · 2019-02-28T18:06:25Z

x-pack/auditbeat/module/system/user/user.go

-	}
+		for _, user := range users {
+			event := ms.userEvent(user, eventTypeState, eventActionExistingUser)
+			event.RootFields.Put("event.id", stateID.String())


stateID will be an all zero uuid if uuid.NewV4() failed. What's the impact of this on the metricset and the consumers of this data?

When sending a list of all users, the event.id is what they have in common, to make it easy for consumers to correlate them. If generating it failed, it would no longer be possible to use it for that purpose - lists of users sent from the same system at different times would have the same (zero) event.id. Instead, they would have to fall back on the less user-friendly timestamp.

It's unlikely to happen, but if it did the dataset would still send all the users, and an error as well.

x-pack/auditbeat/module/system/user/users_linux.go

andrewkroh

Is there anything that needs to be put into the changelog? Or is this unreleased.

cwurm · 2019-03-06T10:30:12Z

Is there anything that needs to be put into the changelog? Or is this unreleased.

Good point, I've added an entry.

…0942) Fixes a number of issues in the `user` dataset discovered while testing on Fedora 29: * When calling C functions via cgo, don't rely on errno as an indicator of success - check the return value first. * Set errno to 0 before calling getpwent(3) or getspent(3). * Make C function calls thread-safe. * Address a systemd bug where it can return ENOENT even when there is no error. * Don't fail on every error. * Fail system test on WARN/ERROR messages in the log. (cherry picked from commit 8ce5440)

…11190) Fixes a number of issues in the `user` dataset discovered while testing on Fedora 29: * When calling C functions via cgo, don't rely on errno as an indicator of success - check the return value first. * Set errno to 0 before calling getpwent(3) or getspent(3). * Make C function calls thread-safe. * Address a systemd bug where it can return ENOENT even when there is no error. * Don't fail on every error. * Fail system test on WARN/ERROR messages in the log. (cherry picked from commit 8ce5440)

…11189) Fixes a number of issues in the `user` dataset discovered while testing on Fedora 29: * When calling C functions via cgo, don't rely on errno as an indicator of success - check the return value first. * Set errno to 0 before calling getpwent(3) or getspent(3). * Make C function calls thread-safe. * Address a systemd bug where it can return ENOENT even when there is no error. * Don't fail on every error. * Fail system test on WARN/ERROR messages in the log. (cherry picked from commit 8ce5440)

cwurm added bug review needs_backport PR is waiting to be backported to other branches. Auditbeat SecOps labels Feb 26, 2019

cwurm requested a review from a team as a code owner February 26, 2019 11:42

andrewkroh reviewed Feb 28, 2019

View reviewed changes

andrewkroh approved these changes Mar 6, 2019

View reviewed changes

Christoph Wurm added 7 commits March 6, 2019 13:36

Correctly handle and pass through errors.

0aa7500

Work around systemd bug.

c1363c8

Fail test on warnings and errors.

c8dd26f

Lock OS thread for C functions.

12a952b

Change to clearErrno().

b3b3d2d

Disable password change detection when testing.

f171709

Changelog

889d942

cwurm force-pushed the user_fedora29 branch from 15c7604 to 889d942 Compare March 6, 2019 13:36

andrewkroh approved these changes Mar 6, 2019

View reviewed changes

cwurm merged commit 8ce5440 into elastic:master Mar 11, 2019

cwurm deleted the user_fedora29 branch March 11, 2019 16:15

cwurm mentioned this pull request Mar 11, 2019

[Auditbeat] Cherry-pick #10942 to 6.7: User dataset: Numerous fixes to error handling #11189

Merged

cwurm removed the needs_backport PR is waiting to be backported to other branches. label Mar 11, 2019

cwurm added the v6.7.0 label Mar 11, 2019

cwurm mentioned this pull request Mar 11, 2019

[Auditbeat] Cherry-pick #10942 to 7.0: User dataset: Numerous fixes to error handling #11190

Merged

cwurm added the v7.0.0 label Mar 11, 2019

cwurm mentioned this pull request Mar 11, 2019

[Auditbeat] Cherry-pick #10942 to 7.x: User dataset: Numerous fixes to error handling #11191

Closed

cwurm added the v7.2.0 label Mar 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Auditbeat] User dataset: Numerous fixes to error handling #10942

[Auditbeat] User dataset: Numerous fixes to error handling #10942

cwurm commented Feb 26, 2019

elasticmachine commented Feb 26, 2019

andrewkroh Feb 28, 2019

cwurm Mar 1, 2019

andrewkroh left a comment

cwurm commented Mar 6, 2019

[Auditbeat] User dataset: Numerous fixes to error handling #10942

[Auditbeat] User dataset: Numerous fixes to error handling #10942

Conversation

cwurm commented Feb 26, 2019

elasticmachine commented Feb 26, 2019

andrewkroh Feb 28, 2019

Choose a reason for hiding this comment

cwurm Mar 1, 2019

Choose a reason for hiding this comment

andrewkroh left a comment

Choose a reason for hiding this comment

cwurm commented Mar 6, 2019