Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't connect to cluster via NAT with enableReadyCheck option #930

Closed
jeremytm opened this issue Jul 15, 2019 · 4 comments · Fixed by #931 · May be fixed by raindigi/haraka-plugin-wildduck#3 or saurabharch/GraphqlType-API-Registration#3
Labels

Comments

@jeremytm
Copy link

jeremytm commented Jul 15, 2019

This has been happening for a while, but we just set enableReadyCheck (connection option) to false so it's not super urgent.

I have reproduced this locally and running in lambda, but only when connecting via a NAT server. It continually fails and calls clusterRetryStrategy, never gaining a ready state.

15:29:54 2019-07-15T03:29:54.362Z ioredis:cluster status: [empty] -> connecting
15:29:54 2019-07-15T03:29:54.401Z ioredis:cluster resolved hostname ec2-XX-X-XX-XX.compute-1.amazonaws.com to IP 172.30.2.200
15:29:54 2019-07-15T03:29:54.403Z ioredis:cluster:connectionPool Reset with [ { host: '172.30.2.200', port: 6379 } ]
15:29:54 2019-07-15T03:29:54.403Z ioredis:cluster:connectionPool Connecting to 172.30.2.200:6379 as master
15:29:54 2019-07-15T03:29:54.404Z ioredis:redis status[172.30.2.200:6379]: [empty] -> wait
15:29:54 2019-07-15T03:29:54.405Z ioredis:cluster getting slot cache from 172.30.2.200:6379
15:29:54 2019-07-15T03:29:54.421Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterRefresher)]: [empty] -> wait
15:29:54 2019-07-15T03:29:54.423Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterRefresher)]: wait -> connecting
15:29:54 2019-07-15T03:29:54.424Z ioredis:redis queue command[172.30.2.200:6379 (ioredisClusterRefresher)]: 0 -> cluster([ 'slots' ])
15:29:54 2019-07-15T03:29:54.425Z ioredis:cluster:subscriber selected a subscriber 172.30.2.200:6379
15:29:54 2019-07-15T03:29:54.425Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterSubscriber)]: [empty] -> wait
15:29:54 2019-07-15T03:29:54.425Z ioredis:cluster:subscriber started
15:29:54 2019-07-15T03:29:54.462Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterRefresher)]: connecting -> connect
15:29:54 2019-07-15T03:29:54.462Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterRefresher)]: connect -> ready
15:29:54 2019-07-15T03:29:54.462Z ioredis:connection set the connection name [ioredisClusterRefresher]
15:29:54 2019-07-15T03:29:54.462Z ioredis:redis write command[172.30.2.200:6379 (ioredisClusterRefresher)]: 0 -> client([ 'setname', 'ioredisClusterRefresher' ])
15:29:54 2019-07-15T03:29:54.463Z ioredis:connection send 1 commands in offline queue
15:29:54 2019-07-15T03:29:54.464Z ioredis:redis write command[172.30.2.200:6379 (ioredisClusterRefresher)]: 0 -> cluster([ 'slots' ])
15:29:54 2019-07-15T03:29:54.483Z ioredis:cluster cluster slots result count: 1
15:29:54 2019-07-15T03:29:54.483Z ioredis:cluster cluster slots result [0]: slots 0~16383 served by 172.30.100.206:6379,172.30.100.206:6379
15:29:54 2019-07-15T03:29:54.542Z ioredis:cluster:connectionPool Reset with [ { host: '172.30.100.206', port: 6379, readOnly: false },{ host: '172.30.100.206', port: 6379, readOnly: true } ]
15:29:54 2019-07-15T03:29:54.542Z ioredis:cluster:connectionPool Disconnect 172.30.2.200:6379 because the node does not hold any slot
15:29:54 2019-07-15T03:29:54.543Z ioredis:redis status[172.30.2.200:6379]: wait -> close
15:29:54 2019-07-15T03:29:54.543Z ioredis:connection skip reconnecting since the connection is manually closed.
15:29:54 2019-07-15T03:29:54.543Z ioredis:redis status[172.30.2.200:6379]: close -> end
15:29:54 2019-07-15T03:29:54.543Z ioredis:cluster:connectionPool Connecting to 172.30.100.206:6379 as slave
15:29:54 2019-07-15T03:29:54.543Z ioredis:redis status[172.30.100.206:6379]: [empty] -> wait
15:29:54 2019-07-15T03:29:54.544Z ioredis:cluster status: connecting -> connect
15:29:54 2019-07-15T03:29:54.562Z ioredis:cluster Ready check failed (Error: Connection is closed.). Reconnecting...
15:29:54 2019-07-15T03:29:54.563Z ioredis:cluster status: connect -> disconnecting
15:29:54 2019-07-15T03:29:54.563Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterSubscriber)]: wait -> close
15:29:54 2019-07-15T03:29:54.563Z ioredis:connection skip reconnecting since the connection is manually closed.
15:29:54 2019-07-15T03:29:54.563Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterSubscriber)]: close -> end
15:29:54 2019-07-15T03:29:54.563Z ioredis:cluster:subscriber stopped
15:29:54 2019-07-15T03:29:54.563Z ioredis:cluster:connectionPool Reset with []
15:29:54 2019-07-15T03:29:54.563Z ioredis:cluster:connectionPool Disconnect 172.30.2.200:6379 because the node does not hold any slot
15:29:54 2019-07-15T03:29:54.563Z ioredis:cluster:connectionPool Disconnect 172.30.100.206:6379 because the node does not hold any slot
15:29:54 2019-07-15T03:29:54.563Z ioredis:redis status[172.30.100.206:6379]: wait -> close
15:29:54 2019-07-15T03:29:54.564Z ioredis:connection skip reconnecting since the connection is manually closed.
15:29:54 2019-07-15T03:29:54.564Z ioredis:redis status[172.30.100.206:6379]: close -> end
15:29:54 2019-07-15T03:29:54.564Z ioredis:cluster status: disconnecting -> close
15:29:54 2019-07-15T03:29:54.565Z	07cb3dc6-a1af-41d0-ba24-94582a2d7bf8	INFO	clusterRetryStrategy called 1 [ { host: 'ec2-XX-X-XX-XX.compute-1.amazonaws.com', port: 6379 } ]
15:29:54 2019-07-15T03:29:54.565Z ioredis:cluster status: close -> reconnecting
... and so on

I have been able to fix this with a small change to the built ioredis code in cluster/index.js:153 by simply surrounding the call to the readyCheck with process.nextTick:

process.nextTick(() => {
    this.readyCheck((err, fail) => {
        if (err || fail) {
            debug("Ready check failed (%s). Reconnecting...", err || fail);
            if (this.status === "connect") {
                this.disconnect(true);
            }
        }
        else {
            readyHandler.call(this);
        }
    });
});

I don't know why this works. I'm hoping either this will help you make a better fix than this, or if you are happy with this fix I am happy to submit a PR.

ioredis 4.11.2 (+ previous versions)

@luin
Copy link
Collaborator

luin commented Jul 15, 2019

There isn't debug info about NAT mapping %s -> %O, which is a little strange since if there's a match on the NAT map, a log will be printed: https://github.com/luin/ioredis/blob/master/lib/cluster/index.ts#L537.

Another strange thing is the cluster slots command returns two same nodes:

15:29:54 2019-07-15T03:29:54.483Z ioredis:cluster cluster slots result [0]: slots 0~16383 served by 172.30.100.206:6379,172.30.100.206:6379

Which seems to be the problem. Did your nat mapper map different nodes into the same one?

@jeremytm
Copy link
Author

jeremytm commented Jul 16, 2019

I fixed the natMap, however the issue still remains.

natMap:

{
  'rugby-cache-0001-001.e5di5i.0001.use1.cache.amazonaws.com:6379': {
    host: 'ec2-52-4-26-14.compute-1.amazonaws.com',
    port: 6379
  }
}

I added a few debugs to the natMapper function and confirmed it's looking for an ip as the key rather than hostname.

The reason for using the DNS name is as recommended in Amazon ElastiCache FAQs:

Please note that, we strongly recommend you use the DNS Names to connect to your nodes as the underlying IP addresses can change (e.g., after cache node replacement).

In any case, even if I change the napMap to use the ip rather than hostname, it still continually tries to reconnect in a loop. Can you see any other things I should try?

hosts [ { host: 'ec2-52-4-26-14.compute-1.amazonaws.com', port: 6379 } ]
options { natMap:
   { '172.30.100.206:6379':
      { host: 'ec2-52-4-26-14.compute-1.amazonaws.com', port: 6379 } } }
  ioredis:cluster status: [empty] -> connecting +0ms
  ioredis:cluster resolved hostname ec2-52-4-26-14.compute-1.amazonaws.com to IP 52.4.26.14 +4ms
  ioredis:cluster:connectionPool Reset with [ { host: '52.4.26.14', port: 6379 } ] +0ms
  ioredis:cluster:connectionPool Connecting to 52.4.26.14:6379 as master +0ms
  ioredis:redis status[52.4.26.14:6379]: [empty] -> wait +0ms
  ioredis:cluster getting slot cache from 52.4.26.14:6379 +3ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterRefresher)]: [empty] -> wait +1ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterRefresher)]: wait -> connecting +1ms
  ioredis:redis queue command[52.4.26.14:6379 (ioredisClusterRefresher)]: 0 -> cluster([ 'slots' ]) +1ms
  ioredis:cluster:subscriber selected a subscriber 52.4.26.14:6379 +0ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterSubscriber)]: [empty] -> wait +1ms
  ioredis:cluster:subscriber started +0ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterRefresher)]: connecting -> connect +200ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterRefresher)]: connect -> ready +0ms
  ioredis:connection set the connection name [ioredisClusterRefresher] +0ms
  ioredis:redis write command[52.4.26.14:6379 (ioredisClusterRefresher)]: 0 -> client([ 'setname', 'ioredisClusterRefresher' ]) +0ms
  ioredis:connection send 1 commands in offline queue +1ms
  ioredis:redis write command[52.4.26.14:6379 (ioredisClusterRefresher)]: 0 -> cluster([ 'slots' ]) +1ms
  ioredis:cluster cluster slots result count: 1 +401ms
  ioredis:cluster NAT mapping 172.30.100.206:6379 -> { host: 'ec2-52-4-26-14.compute-1.amazonaws.com', port: 6379 } +0ms
  ioredis:cluster NAT mapping 172.30.100.206:6379 -> { host: 'ec2-52-4-26-14.compute-1.amazonaws.com', port: 6379 } +0ms
  ioredis:cluster cluster slots result [0]: slots 0~16383 served by ec2-52-4-26-14.compute-1.amazonaws.com:6379,ec2-52-4-26-14.compute-1.amazonaws.com:6379 +0ms
  ioredis:cluster:connectionPool Reset with [ { host: 'ec2-52-4-26-14.compute-1.amazonaws.com',
  ioredis:cluster:connectionPool     port: 6379,
  ioredis:cluster:connectionPool     readOnly: false },
  ioredis:cluster:connectionPool   { host: 'ec2-52-4-26-14.compute-1.amazonaws.com',
  ioredis:cluster:connectionPool     port: 6379,
  ioredis:cluster:connectionPool     readOnly: true } ] +403ms
  ioredis:cluster:connectionPool Disconnect 52.4.26.14:6379 because the node does not hold any slot +0ms
  ioredis:redis status[52.4.26.14:6379]: wait -> close +198ms
  ioredis:connection skip reconnecting since the connection is manually closed. +198ms
  ioredis:redis status[52.4.26.14:6379]: close -> end +0ms
  ioredis:cluster:connectionPool Connecting to ec2-52-4-26-14.compute-1.amazonaws.com:6379 as slave +1ms
  ioredis:redis status[ec2-52-4-26-14.compute-1.amazonaws.com:6379]: [empty] -> wait +0ms
  ioredis:cluster status: connecting -> connect +2ms
  ioredis:cluster Ready check failed (Error: Connection is closed.). Reconnecting... +2ms
  ioredis:cluster status: connect -> disconnecting +0ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterSubscriber)]: wait -> close +3ms
  ioredis:connection skip reconnecting since the connection is manually closed. +3ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterSubscriber)]: close -> end +0ms
  ioredis:cluster:subscriber stopped +402ms
  ioredis:cluster:connectionPool Reset with [] +3ms
  ioredis:cluster:connectionPool Disconnect 52.4.26.14:6379 because the node does not hold any slot +0ms
  ioredis:cluster:connectionPool Disconnect ec2-52-4-26-14.compute-1.amazonaws.com:6379 because the node does not hold any slot +0ms
  ioredis:redis status[ec2-52-4-26-14.compute-1.amazonaws.com:6379]: wait -> close +0ms
  ioredis:connection skip reconnecting since the connection is manually closed. +0ms
  ioredis:redis status[ec2-52-4-26-14.compute-1.amazonaws.com:6379]: close -> end +0ms
  ioredis:cluster status: disconnecting -> close +1ms
  ioredis:cluster status: close -> reconnecting +1ms
  ioredis:cluster Cluster is disconnected. Retrying after 102ms +104ms
  ioredis:cluster status: reconnecting -> connecting +1ms

It still works fine if I set enableReadyCheck to false.

@luin
Copy link
Collaborator

luin commented Jul 16, 2019

Thanks for the details. I've spotted the root reason for this issue. The issue is caused by the ready check command being sent to the node (52.4.26.14) that was disconnecting, instead of the new node (ec2-52-4-26-14.compute-1.amazonaws.com). Wrapping readyCheck with process.nextTick do the trick because the old node will be removed from the connection pool in the next tick so that the command will be sent to the new node.

I'll create a fix for that this week. For now, you can safely turn enableReadyCheck off.

@luin luin closed this as completed in #931 Jul 16, 2019
ioredis-robot pushed a commit that referenced this issue Jul 16, 2019
## [4.12.2](v4.12.1...v4.12.2) (2019-07-16)

### Bug Fixes

* **cluster:** prefer master when there're two same node for a slot ([8fb9f97](8fb9f97))
* **cluster:** remove node immediately when slots are redistributed ([ecc13ad](ecc13ad)), closes [#930](#930)
@ioredis-robot
Copy link
Collaborator

🎉 This issue has been resolved in version 4.12.2 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment