cleaner-hsm: `info` command needs more info, and better logging #7581

onnozweers · 2024-05-27T08:31:58Z

Dear dCache devs,

I'm currently troubleshooting cleaner-hsm congestion (version 9.2.18). A challenge I'm facing is that it can't see which runs the cleaner has submitted to which pools, and which of these runs it is still expecting a reply from.

We have seen that our HSM script did too many retries for removals, and this caused timeouts between cleaner-hsm and pool. Since the cleaner will do retries anyway, we have now disabled retries for removals in our HSM script. Still, things are not as smoothly as I hoped. It currently is very difficult to see:

which runs have been submitted
when
to which pools
and what the status of each run is.

It would be very helpful if such information was shown by the info command.

As for logging: with logging set to debug, new jobs are shown as "New run...", without any useful information. Only when a job finishes, there is some relevant info:

24 May 2024 14:53:21 (cleaner-hsm) [marten10_lofopsstage PoolRemoveFilesFromHSM] Received a remove reply from pool marten10_lofopsstage, which is no longer waited for. The cleaner pool timeout might be too small.
24 May 2024 14:53:21 (cleaner-hsm) [marten10_lofopsstage PoolRemoveFilesFromHSM] Pool delete responses for HSM osm: 834 success, 1166 failures

It would be helpful if the "New run" log entry included more details, especially to which node the run is submitted (so that I may monitor it there).

Additionally, there is this setting in the cleaner-hsm:

dcache/skel/share/defaults/cleaner-hsm.properties

Line 64 in 0597b1f

cleaner-hsm.limits.threads = 5

It's currently not clear (to me at least) what effect this setting has. Is it the maximum number of concurrent runs? It might help if the info command showed what these threads were doing.

If someone were to pick this up, it might be efficient to apply similar changes to the cleaner-disk.

Thanks!

The text was updated successfully, but these errors were encountered:

onnozweers · 2024-05-27T09:42:08Z

Apparently our cleaner-hsm was a bit confused. I restarted it, and now at least it seems to log to which node it is sending runs.

27 May 2024 11:31:44 (cleaner-hsm) [] New run...
27 May 2024 11:31:44 (cleaner-hsm) [] sending 1 delete locations for HSM osm to pool otter12_generictape
27 May 2024 11:31:46 (cleaner-hsm) [otter12_generictape PoolRemoveFilesFromHSM] Pool delete responses for HSM osm: 1 success, 0 failures
27 May 2024 11:31:46 (cleaner-hsm) [otter12_generictape PoolRemoveFilesFromHSM] sending 100 delete locations for HSM osm to pool hedgehog8_generictape
27 May 2024 11:32:27 (cleaner-hsm) [hedgehog8_generictape PoolRemoveFilesFromHSM] Pool delete responses for HSM osm: 100 success, 0 failures
27 May 2024 11:32:27 (cleaner-hsm) [hedgehog8_generictape PoolRemoveFilesFromHSM] sending 100 delete locations for HSM osm to pool seacow8_projectminestage

Not sure though why the responses seem to be logged before the sending entries.

About the state of the cleaner: could the following lines (from before the restart) shed any light on it having a huge backlog?

27 May 2024 11:27:00 (cleaner-hsm) [] The number of cached HSM locations is already the maximum permissible size. Not adding further entries.
27 May 2024 11:27:30 (cleaner-hsm) [] HikariPool-2 - Before cleanup stats (total=1, active=0, idle=1, waiting=0)
27 May 2024 11:27:30 (cleaner-hsm) [] HikariPool-2 - After cleanup  stats (total=1, active=0, idle=1, waiting=0)
27 May 2024 11:27:30 (cleaner-hsm) [] HikariPool-2 - Fill pool skipped, pool has sufficient level or currently being filled (queueDepth=0).

lemora · 2024-05-27T09:59:40Z

Hi Onno.
Could you plese send your entire cleaner-hsm configuration?
I'll look into what can be done in terms of improved info/logging.

Cleaner-disk uses the configured threads to clean that many pools in parallel and reports which ones are currently targeted in info, but this does not make sense for cleaner-hsm.

onnozweers · 2024-05-27T10:05:09Z

Hi Lea,

Here is our current config. I included the cleaner-disk settings.

[cleanerDomain]
[cleanerDomain/cleaner-disk]
# Number of threads that should work on deletions simultaneously.
# One thread does HSM deletions, the others disk deletions.
# Default: 5
cleaner-disk.limits.threads = 20
# Interval at which the cleaner runs. Default: 120
cleaner-disk.limits.period = 30

[cleanerDomain/cleaner-hsm]
# Each run, the HSM cleaner will send this many deletions
# to a randomly chosen tape pool.
# Default: 500
#cleaner-hsm.limits.batch-size = 2000
# HSM deletions will be regard failure if they don't finish in time.
# Default: 120. If tape system is busy, that's not enough.
cleaner-hsm.service.pool.timeout = 300

We've done considerable tuning over the last few days to try and get rid of the cleaner-hsm backlog. After the restart, the cleaner-hsm reverted to these settings.

I'm not sure what you mean with: "but this does not make sense for cleaner-hsm".

onnozweers · 2024-05-27T10:10:17Z

I just noticed this log entry:

27 May 2024 11:47:24 (cleaner-hsm) [hedgehog11_tropomirwtape PoolRemoveFilesFromHSM] sending 100 delete locations for HSM osm to pool koala9_tropomirwtape

Why are there two different pool names in this entry? Is one pool sending or forwarding work to another?

onnozweers · 2024-05-27T11:51:42Z

Another question. By monitoring the number of 'delete' operations in the pool's info output, I can see that even when the removals have all completed in time, the cleaner still considers them a failure after a timeout. There is a few minutes between the moment the deletes on the pool reach 0/0 (13:28:33) and the moment the cleaner-hsm logs a timeout for the same pool.

27 May 2024 13:27:32 (cleaner-hsm) [] sending 500 delete locations for HSM osm to pool penguin9_projectminestage
27 May 2024 13:27:45 (cleaner-hsm) [] New run...
27 May 2024 13:29:45 (cleaner-hsm) [] New run...
27 May 2024 13:31:45 (cleaner-hsm) [] New run...
27 May 2024 13:32:32 (cleaner-hsm) [] Timeout deleting files on HSM osm attached to penguin9_projectminestage

As you can see from the plateaus in this graph, this seems to happen a lot, and it slows down the processing.

Why does the cleaner get a timeout even when all deletes have completed? Or should I make a separate ticket for this?

Motivation: Modification: To cleaner-hsm's admin `info` command, add information about which pools are currently waited for. These pools are instructed to delete files from an attached HSM and should reply with success for failure once done. Additionally, change pool selection behaviour to only select one pool for hsm-cleaning at a time. Result: More information for admins: waited-for pools are now shown in cleaner-hsm's `info` command. Target: master Requires-notes: no Requires-book: no Addresses: #7581 Acked-by: Tigran Mkrtchyan Patch: https://rb.dcache.org/r/14270/

lemora self-assigned this May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cleaner-hsm: `info` command needs more info, and better logging #7581

cleaner-hsm: `info` command needs more info, and better logging #7581

onnozweers commented May 27, 2024

onnozweers commented May 27, 2024

lemora commented May 27, 2024 •

edited

Loading

onnozweers commented May 27, 2024

onnozweers commented May 27, 2024

onnozweers commented May 27, 2024 •

edited

Loading

cleaner-hsm: info command needs more info, and better logging #7581

cleaner-hsm: info command needs more info, and better logging #7581

Comments

onnozweers commented May 27, 2024

onnozweers commented May 27, 2024

lemora commented May 27, 2024 • edited Loading

onnozweers commented May 27, 2024

onnozweers commented May 27, 2024

onnozweers commented May 27, 2024 • edited Loading

cleaner-hsm: `info` command needs more info, and better logging #7581

cleaner-hsm: `info` command needs more info, and better logging #7581

lemora commented May 27, 2024 •

edited

Loading

onnozweers commented May 27, 2024 •

edited

Loading