Cluster bricks when all journalnodes are down #338

nightkr · 2023-03-30T08:40:27Z

Affected version

main

Current and expected behavior

The JournalNodes seem to get stuck in a crashloop if all are deleted at the same time, complaining about not being able to find missing edits.

Possible solution

No response

Additional context

No response

Environment

No response

Would you like to work on fixing this bug?

None

sbernauer · 2023-06-01T12:25:03Z

Good news:
IIRC this happened when demo-ing Kerberos support, I did also run into this problem.
The current WIP state of hdfs Kerberos has solved the problem and added kuttl tests to test this scenario. Linking the PR als fix

sbernauer · 2023-06-05T09:55:17Z

The problem was the journalnodes rejecting the namenodes because of reverse DNS roulette

org.apache.hadoop.hdfs.server.common.HttpGetFailedException: Fetch of https://hdfs-journalnode-default-0.hdfs-journalnode-default.kuttl-test-fine-rat.svc.cluster.local:8481/getJournal?jid=hdfs&segmentTxId=1&storageInfo=-65%3A595659877%3A1685437352616%3ACID-90c52400-5b07-49bf-bdbe-3469bbdc5ebb&inProgressOk=true failed with status code 403
Response message:
Only Namenode and another JournalNode may access this servlet

resulting in

2023-05-30 09:23:01,651 ERROR namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(233)) - Got error reading edit log input stream https://hdfs-journalnode-default-0.hdfs-journalnode-default.kuttl-test-fine-rat.svc.cluster.local:8481/getJournal?jid=hdfs&segmentTxId=1&storageInfo=-65%3A595659877%3A1685437352616%3ACID-90c52400-5b07-49bf-bdbe-3469bbdc5ebb&inProgressOk=true; failing over to edit log https://hdfs-journalnode-default-1.hdfs-journalnode-default.kuttl-test-fine-rat.svc.cluster.local:8481/getJournal?jid=hdfs&segmentTxId=1&storageInfo=-65%3A595659877%3A1685437352616%3ACID-90c52400-5b07-49bf-bdbe-3469bbdc5ebb&inProgressOk=true

The got premature end-of-file at txid 0; expected file to go up to 10 was because of the failover

I solved the problem by not using _HOST in the principals, so that journalnode did not reject the namenode

sbernauer · 2023-06-05T09:56:38Z

Closing this, as we fixed the problem in the Kerberos feature-branch. Please feel free to re-open when the problem reappears!

@maltesander

# Description Closes #178 Fixes #338 TODOs - [x] Release new Hadoop image with openssl and Kerberos clients use in docs and tests - [x] Release and use operator-rs change - [x] Fix hardcoded `kinit nn/simple-hdfs-namenode-default.default.svc.cluster.local@CLUSTER.LOCAL -kt /stackable/kerberos/keytab` in entrypoints - [x] Go through all hadoop settings and see if they can be improved - [X] Test different realms - [x] Discuss CRD change - [x] Discuss how to expose this in Discovery CM -> During on-site 2023/05 we have decided to ship this feature without exposing it via discovery *for now* - [x] Implement discovery - [x] Tests - [x] Docs - [x] Let @maltesander have a look how we can better include the init container in the code structure - [x] Test long running cluster (maybe turn down ticket lifetime for that)

nightkr added the type/bug label Mar 30, 2023

lfrancke added the priority/high label Mar 30, 2023

sbernauer mentioned this issue May 25, 2023

[Merged by Bors] - Allow enabling secure mode with Kerberos #334

Closed

12 tasks

sbernauer closed this as completed Jun 5, 2023

lfrancke added the release/2023-07 label Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster bricks when all journalnodes are down #338

Cluster bricks when all journalnodes are down #338

nightkr commented Mar 30, 2023

sbernauer commented Jun 1, 2023

sbernauer commented Jun 5, 2023

sbernauer commented Jun 5, 2023

Cluster bricks when all journalnodes are down #338

Cluster bricks when all journalnodes are down #338

Comments

nightkr commented Mar 30, 2023

Affected version

Current and expected behavior

Possible solution

Additional context

Environment

Would you like to work on fixing this bug?

sbernauer commented Jun 1, 2023

sbernauer commented Jun 5, 2023

sbernauer commented Jun 5, 2023