Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reconnect to ZooKeeper service, session expired #7564

Closed
QiulanHuang opened this issue May 3, 2024 · 8 comments
Closed

Unable to reconnect to ZooKeeper service, session expired #7564

QiulanHuang opened this issue May 3, 2024 · 8 comments

Comments

@QiulanHuang
Copy link

Dear all,

Recently, we noticed some pools failed to reconnect to ZooKeeper Service complaining the session expired. It needs to restart the pool to fix it. The error log is listed below.

03 May 2024 10:11:46 (System) [] Session 0x10041c40b758d24 for server dczoo01.usatlas.bnl.gov/10.42.34.241:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionExpiredException: Unable to reconnect to ZooKeeper service, session 0x10041c40b758d24 has expired
        at org.apache.zookeeper.ClientCnxn$SendThread.onConnected(ClientCnxn.java:1432)
        at org.apache.zookeeper.ClientCnxnSocket.readConnectResult(ClientCnxnSocket.java:154)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:86)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1283)
03 May 2024 10:11:46 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}

We are using dCache 9.2.17. The issue also happened in the old version 9.2.6, btw.
It's not clear the root cause. I didn't noticed the network problem between the pools and the Zookeeper servers.

Regards,
Qiulan

@lemora
Copy link
Member

lemora commented May 7, 2024

Hi Qiulan.

Which ZooKeeper version are you using? You have a cluster of 3 standalone ones, right?
The line Invalid config event received indicates that you ZooKeeper configuration is wrong, please check it.
Since 6.2, dCache requires the associated ZooKeeper servers to be of version 3.5 or above, but with ZooKeeper 3.5.0 the configuration format has changed. Previously, the zoo.conf files could contain a separate line stating the client port, clientPort=2181, which now needs to be at the end of the line: server.1=os-zk-node01:2888:3888;2181. As far as I know, newer versions were backward compatible and this was just a warning, but maybe this has changed.

As to the SessionExpiredException: Unable to reconnect to ZooKeeper service: is your zookeeper cluster stable, is there anything logged on the zk side?

Lea

@QiulanHuang
Copy link
Author

Hello Lea,

Thank you for your reply.

We are using ZooKeeper 3.6.3. The logs on ZK side shows the session expired too.

2024-05-03 10:11:46,254 [myid:0] - INFO  [QuorumPeer[myid=0](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):ZooKeeperServer@1059] - Invalid session 0x10041c40b758d24 for client /10.42.38.120:43812, probably expired

Thanks,
Qiulan

@cfgamboa
Copy link

cfgamboa commented May 8, 2024

Yes we have a cluster of three nodes and it is stable

@kofemann
Copy link
Member

@QiulanHuang Does all dcache components run 9.2.17?

@QiulanHuang
Copy link
Author

Hello @kofemann We are running in mixed mode now. Most of the nodes are running 9.2.17 and around 10 new added pool nodes are running 9.2.20.

Thanks,
Qiulan

@cfgamboa
Copy link

cfgamboa commented Sep 18, 2024

Hello

The goal is to simulate a host network disturbance and the pool's state.

dc234 is used to simulate this, there is a test pool enabled.

dcache pool ls
POOL    DOMAIN         META SIZE    FREE    PATH                      
dc234_1 dc234oneDomain db   117714G 134108G /data0/dcache_pool_1/pool 

Pool restarted

22 Aug 2024 10:30:07 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}
22 Aug 2024 10:30:07 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
22 Aug 2024 10:30:11 (dc234_1) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/data0/dcache_pool_1/pool/data;meta=/data0/dcache_pool_1/pool/meta]]].
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to enabled

State of the network bond0 active peers

[root@dc234 ~]# ifstat 
#kernel
Interface        RX Pkts/Rate    TX Pkts/Rate    RX Data/Rate    TX Data/Rate  
                 RX Errs/Drop    TX Errs/Drop    RX Over/Rate    TX Coll/Rate  
lo               546421K 0       546421K 0         1980M 0         1980M 0      
                       0 0             0 0             0 0             0 0      
Internal25G00      2405M 0         4114M 0         3974M 0       882022K 0      
                       0 3319K         0 0             0 0             0 0      
Internal25G01      3832M 0         2440M 0         3848M 0         3711M 0      
                       0 2472K         0 0             0 0             0 0      
bond0              1018M 0         2847M 0         2631M 0         3394M 0      
                       0 10735K        0 0             0 0             0 0    

simulating partial network degration by stopping a peer from bond

[root@dc234 ~]# ip link set dev Internal25G00 down
[root@dc234 ~]# tail -100f /var/log/messages|grep bond
Aug 22 10:35:34 dc234 kernel: bond0: link status definitely down for interface Internal25G00, disabling it


[root@dc234 ~]# ifstat
#kernel
Interface        RX Pkts/Rate    TX Pkts/Rate    RX Data/Rate    TX Data/Rate  
                 RX Errs/Drop    TX Errs/Drop    RX Over/Rate    TX Coll/Rate  
lo                    60 0            60 0          5128 0          5128 0      
                       0 0             0 0             0 0             0 0      
Internal25G01       3563 0          4925 0        278760 0         3034K 0      
                       0 0             0 0             0 0             0 0      
bond0               7136 0          5629 0        534351 0         3209K 0      
                       0 0             0 0             0 0             0 0 

checking state of pool

[root@dc234 ~]# tail -f /var/log/dcache/dc234oneDomain.log
22 Aug 2024 10:30:07 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
22 Aug 2024 10:30:11 (dc234_1) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/data0/dcache_pool_1/pool/data;meta=/data0/dcache_pool_1/pool/meta]]].
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to enabled
22 Aug 2024 10:35:58 (System) [] Client session timed out, have not heard from server in 26680ms for session id 0x10041c40b759013
22 Aug 2024 10:35:58 (System) [] Session 0x10041c40b759013 for server dczoo02.usatlas.bnl.gov/10.42.34.242:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 26680ms for session id 0x10041c40b759013
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)
22 Aug 2024 10:35:58 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}

Relevant error message is

22 Aug 2024 10:35:58 (System) [] Client session timed out, have not heard from server in 26680ms for session id 0x10041c40b759013
22 Aug 2024 10:35:58 (System) [] Session 0x10041c40b759013 for server dczoo02.usatlas.bnl.gov/10.42.34.242:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 26680ms for session id 0x10041c40b759013
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)

2.1. Pool appears operational without restarting it

[dccore01] (dc234_1@dc234oneDomain) admin > pp get file 0000384776BCD2D349CB8CCF1ADF11952C87 dc262_6
Transfer Initiated
[dccore01] (dc234_1@dc234oneDomain) admin > rep ls
0000384776BCD2D349CB8CCF1ADF11952C87 <C-------X--L(0)[0]> 4967844 si={cgamboa:USERS}
re-enable the interface to the bond
[root@dc234 ~]# ip link set dev Internal25G00 up
[root@dc234 ~]# ifstat
#kernel
Interface        RX Pkts/Rate    TX Pkts/Rate    RX Data/Rate    TX Data/Rate  
                 RX Errs/Drop    TX Errs/Drop    RX Over/Rate    TX Coll/Rate  
lo                     0 0             0 0             0 0             0 0      
                       0 0             0 0             0 0             0 0      
Internal25G00      2405M 0         4114M 0         3974M 0       882199K 0      
                       0 3319K         0 0             0 0             0 0      
Internal25G01       8005 0          4302 0         5612K 0         3538K 0      
                       0 0             0 0             0 0             0 0      
bond0              2405M 0      18446744073529M 0      18446744073394M 0       885738K 0      
                       0 3319K         0 0             0 0             0 0      
simulating the other interface failure
22 Aug 2024 10:43:19 (System) [] Client session timed out, have not heard from server in 26681ms for session id 0x10041c40b759013
22 Aug 2024 10:43:19 (System) [] Session 0x10041c40b759013 for server dczoo03.usatlas.bnl.gov/10.42.34.243:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 26681ms for session id 0x10041c40b759013
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)

The pool remains accessible

[dccore01] (dc234_1@dc234oneDomain) admin > rep ls
0000384776BCD2D349CB8CCF1ADF11952C87 <C-------X--L(0)[0]> 4967844 si={c[gamboa](https://git.racf.bnl.gov/gitea/gamboa):USERS}
gamboa
gamboa commented [4 weeks ago](https://git.racf.bnl.gov/gitea/SDCC/dcache/issues/2#issuecomment-1639)
Poster
Collaborator
Automatic restart of pool due to the zookeeper event

22 Aug 2024 10:43:19 (System) [] Session 0x10041c40b759013 for server dczoo03.usatlas.bnl.gov/10.42.34.243:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 26681ms for session id 0x10041c40b759013
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)

So for network instabilities the systems appears to be handling accordingly the failover
Will it be simple for dcache to start cataloging this type of errors?

@kofemann
Copy link
Member

@cfgamboa thanks for the detailed update

Will it be simple for dcache to start cataloging this type of errors?

I don't understand the question. What is the expected behavior?

@cfgamboa
Copy link

For my simulation the expected behavior is that pool remains operational without having to be restarted when network degradation occurs. The error observed on the pool might be able to be classified? This could help when other similar errors occur like the one reported by initially in the ticket.

khys95 pushed a commit to khys95/dcache that referenced this issue Oct 7, 2024
Motivation:
ZK logs transient errors at warning log level, thus confusing dCache admins.

```
22 Aug 2024 10:43:19 (System) [] Session 0x10041c40b759013 for server xxxxxxx, Closing socket connection....
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from ....
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)
```

Modification:
Update logback config to suppress warning messages.

Result:
Less noise in the log files.

Fixes: dCache#7564
Acked-by: Lea Morschel
Target: master
Require-book: no
Require-notes: yes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants