Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster bricks when all journalnodes are down #338

Closed
nightkr opened this issue Mar 30, 2023 · 3 comments
Closed

Cluster bricks when all journalnodes are down #338

nightkr opened this issue Mar 30, 2023 · 3 comments

Comments

@nightkr
Copy link
Member

nightkr commented Mar 30, 2023

Affected version

main

Current and expected behavior

The JournalNodes seem to get stuck in a crashloop if all are deleted at the same time, complaining about not being able to find missing edits.

Possible solution

No response

Additional context

No response

Environment

No response

Would you like to work on fixing this bug?

None

@sbernauer
Copy link
Member

Good news:
IIRC this happened when demo-ing Kerberos support, I did also run into this problem.
The current WIP state of hdfs Kerberos has solved the problem and added kuttl tests to test this scenario. Linking the PR als fix

@sbernauer
Copy link
Member

The problem was the journalnodes rejecting the namenodes because of reverse DNS roulette

org.apache.hadoop.hdfs.server.common.HttpGetFailedException: Fetch of https://hdfs-journalnode-default-0.hdfs-journalnode-default.kuttl-test-fine-rat.svc.cluster.local:8481/getJournal?jid=hdfs&segmentTxId=1&storageInfo=-65%3A595659877%3A1685437352616%3ACID-90c52400-5b07-49bf-bdbe-3469bbdc5ebb&inProgressOk=true failed with status code 403
Response message:
Only Namenode and another JournalNode may access this servlet

resulting in

2023-05-30 09:23:01,651 ERROR namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(233)) - Got error reading edit log input stream https://hdfs-journalnode-default-0.hdfs-journalnode-default.kuttl-test-fine-rat.svc.cluster.local:8481/getJournal?jid=hdfs&segmentTxId=1&storageInfo=-65%3A595659877%3A1685437352616%3ACID-90c52400-5b07-49bf-bdbe-3469bbdc5ebb&inProgressOk=true; failing over to edit log https://hdfs-journalnode-default-1.hdfs-journalnode-default.kuttl-test-fine-rat.svc.cluster.local:8481/getJournal?jid=hdfs&segmentTxId=1&storageInfo=-65%3A595659877%3A1685437352616%3ACID-90c52400-5b07-49bf-bdbe-3469bbdc5ebb&inProgressOk=true

The got premature end-of-file at txid 0; expected file to go up to 10 was because of the failover

I solved the problem by not using _HOST in the principals, so that journalnode did not reject the namenode

@sbernauer
Copy link
Member

Closing this, as we fixed the problem in the Kerberos feature-branch. Please feel free to re-open when the problem reappears!

bors bot pushed a commit that referenced this issue Jun 14, 2023
# Description

Closes #178
Fixes #338

TODOs

- [x] Release new Hadoop image with openssl and Kerberos clients use in docs and tests
- [x] Release and use operator-rs change
- [x] Fix hardcoded `kinit nn/simple-hdfs-namenode-default.default.svc.cluster.local@CLUSTER.LOCAL -kt /stackable/kerberos/keytab` in entrypoints
- [x] Go through all hadoop settings and see if they can be improved
- [X] Test different realms
- [x] Discuss CRD change
- [x] Discuss how to expose this in Discovery CM -> During on-site 2023/05 we have decided to ship this feature without exposing it via discovery *for now*
- [x] Implement discovery
- [x] Tests
- [x] Docs
- [x] Let  @maltesander have a look how we can better include the init container in the code structure
- [x] Test long running cluster (maybe turn down ticket lifetime for that)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants