Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleaner-hsm: info command needs more info, and better logging #7581

Open
onnozweers opened this issue May 27, 2024 · 5 comments
Open

cleaner-hsm: info command needs more info, and better logging #7581

onnozweers opened this issue May 27, 2024 · 5 comments
Assignees

Comments

@onnozweers
Copy link
Contributor

Dear dCache devs,

I'm currently troubleshooting cleaner-hsm congestion (version 9.2.18). A challenge I'm facing is that it can't see which runs the cleaner has submitted to which pools, and which of these runs it is still expecting a reply from.

We have seen that our HSM script did too many retries for removals, and this caused timeouts between cleaner-hsm and pool. Since the cleaner will do retries anyway, we have now disabled retries for removals in our HSM script. Still, things are not as smoothly as I hoped. It currently is very difficult to see:

  • which runs have been submitted
  • when
  • to which pools
  • and what the status of each run is.

It would be very helpful if such information was shown by the info command.

As for logging: with logging set to debug, new jobs are shown as "New run...", without any useful information. Only when a job finishes, there is some relevant info:

24 May 2024 14:53:21 (cleaner-hsm) [marten10_lofopsstage PoolRemoveFilesFromHSM] Received a remove reply from pool marten10_lofopsstage, which is no longer waited for. The cleaner pool timeout might be too small.
24 May 2024 14:53:21 (cleaner-hsm) [marten10_lofopsstage PoolRemoveFilesFromHSM] Pool delete responses for HSM osm: 834 success, 1166 failures

It would be helpful if the "New run" log entry included more details, especially to which node the run is submitted (so that I may monitor it there).

Additionally, there is this setting in the cleaner-hsm:

cleaner-hsm.limits.threads = 5

It's currently not clear (to me at least) what effect this setting has. Is it the maximum number of concurrent runs? It might help if the info command showed what these threads were doing.

If someone were to pick this up, it might be efficient to apply similar changes to the cleaner-disk.

Thanks!

@onnozweers
Copy link
Contributor Author

Apparently our cleaner-hsm was a bit confused. I restarted it, and now at least it seems to log to which node it is sending runs.

27 May 2024 11:31:44 (cleaner-hsm) [] New run...
27 May 2024 11:31:44 (cleaner-hsm) [] sending 1 delete locations for HSM osm to pool otter12_generictape
27 May 2024 11:31:46 (cleaner-hsm) [otter12_generictape PoolRemoveFilesFromHSM] Pool delete responses for HSM osm: 1 success, 0 failures
27 May 2024 11:31:46 (cleaner-hsm) [otter12_generictape PoolRemoveFilesFromHSM] sending 100 delete locations for HSM osm to pool hedgehog8_generictape
27 May 2024 11:32:27 (cleaner-hsm) [hedgehog8_generictape PoolRemoveFilesFromHSM] Pool delete responses for HSM osm: 100 success, 0 failures
27 May 2024 11:32:27 (cleaner-hsm) [hedgehog8_generictape PoolRemoveFilesFromHSM] sending 100 delete locations for HSM osm to pool seacow8_projectminestage

Not sure though why the responses seem to be logged before the sending entries.

About the state of the cleaner: could the following lines (from before the restart) shed any light on it having a huge backlog?

27 May 2024 11:27:00 (cleaner-hsm) [] The number of cached HSM locations is already the maximum permissible size. Not adding further entries.
27 May 2024 11:27:30 (cleaner-hsm) [] HikariPool-2 - Before cleanup stats (total=1, active=0, idle=1, waiting=0)
27 May 2024 11:27:30 (cleaner-hsm) [] HikariPool-2 - After cleanup  stats (total=1, active=0, idle=1, waiting=0)
27 May 2024 11:27:30 (cleaner-hsm) [] HikariPool-2 - Fill pool skipped, pool has sufficient level or currently being filled (queueDepth=0).

@lemora
Copy link
Member

lemora commented May 27, 2024

Hi Onno.
Could you plese send your entire cleaner-hsm configuration?
I'll look into what can be done in terms of improved info/logging.

Cleaner-disk uses the configured threads to clean that many pools in parallel and reports which ones are currently targeted in info, but this does not make sense for cleaner-hsm.

@lemora lemora self-assigned this May 27, 2024
@onnozweers
Copy link
Contributor Author

Hi Lea,

Here is our current config. I included the cleaner-disk settings.

[cleanerDomain]
[cleanerDomain/cleaner-disk]
# Number of threads that should work on deletions simultaneously.
# One thread does HSM deletions, the others disk deletions.
# Default: 5
cleaner-disk.limits.threads = 20
# Interval at which the cleaner runs. Default: 120
cleaner-disk.limits.period = 30

[cleanerDomain/cleaner-hsm]
# Each run, the HSM cleaner will send this many deletions
# to a randomly chosen tape pool.
# Default: 500
#cleaner-hsm.limits.batch-size = 2000
# HSM deletions will be regard failure if they don't finish in time.
# Default: 120. If tape system is busy, that's not enough.
cleaner-hsm.service.pool.timeout = 300

We've done considerable tuning over the last few days to try and get rid of the cleaner-hsm backlog. After the restart, the cleaner-hsm reverted to these settings.

I'm not sure what you mean with: "but this does not make sense for cleaner-hsm".

@onnozweers
Copy link
Contributor Author

I just noticed this log entry:

27 May 2024 11:47:24 (cleaner-hsm) [hedgehog11_tropomirwtape PoolRemoveFilesFromHSM] sending 100 delete locations for HSM osm to pool koala9_tropomirwtape

Why are there two different pool names in this entry? Is one pool sending or forwarding work to another?

@onnozweers
Copy link
Contributor Author

onnozweers commented May 27, 2024

Another question. By monitoring the number of 'delete' operations in the pool's info output, I can see that even when the removals have all completed in time, the cleaner still considers them a failure after a timeout. There is a few minutes between the moment the deletes on the pool reach 0/0 (13:28:33) and the moment the cleaner-hsm logs a timeout for the same pool.

27 May 2024 13:27:32 (cleaner-hsm) [] sending 500 delete locations for HSM osm to pool penguin9_projectminestage
27 May 2024 13:27:45 (cleaner-hsm) [] New run...
27 May 2024 13:29:45 (cleaner-hsm) [] New run...
27 May 2024 13:31:45 (cleaner-hsm) [] New run...
27 May 2024 13:32:32 (cleaner-hsm) [] Timeout deleting files on HSM osm attached to penguin9_projectminestage

As you can see from the plateaus in this graph, this seems to happen a lot, and it slows down the processing.

image

Why does the cleaner get a timeout even when all deletes have completed? Or should I make a separate ticket for this?

lemora added a commit that referenced this issue Jul 30, 2024
Motivation:
Modification:

To cleaner-hsm's admin `info` command, add information about which pools are currently waited for. These pools are instructed to delete files from an attached HSM and should reply with success for failure once done.
Additionally, change pool selection behaviour to only select one pool for hsm-cleaning at a time.

Result:

More information for admins: waited-for pools are now shown in cleaner-hsm's `info` command.

Target: master
Requires-notes: no
Requires-book: no
Addresses: #7581
Acked-by: Tigran Mkrtchyan
Patch: https://rb.dcache.org/r/14270/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants