Can't connect to cluster via NAT with enableReadyCheck option #930

jeremytm · 2019-07-15T03:52:12Z

This has been happening for a while, but we just set enableReadyCheck (connection option) to false so it's not super urgent.

I have reproduced this locally and running in lambda, but only when connecting via a NAT server. It continually fails and calls clusterRetryStrategy, never gaining a ready state.

15:29:54 2019-07-15T03:29:54.362Z ioredis:cluster status: [empty] -> connecting
15:29:54 2019-07-15T03:29:54.401Z ioredis:cluster resolved hostname ec2-XX-X-XX-XX.compute-1.amazonaws.com to IP 172.30.2.200
15:29:54 2019-07-15T03:29:54.403Z ioredis:cluster:connectionPool Reset with [ { host: '172.30.2.200', port: 6379 } ]
15:29:54 2019-07-15T03:29:54.403Z ioredis:cluster:connectionPool Connecting to 172.30.2.200:6379 as master
15:29:54 2019-07-15T03:29:54.404Z ioredis:redis status[172.30.2.200:6379]: [empty] -> wait
15:29:54 2019-07-15T03:29:54.405Z ioredis:cluster getting slot cache from 172.30.2.200:6379
15:29:54 2019-07-15T03:29:54.421Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterRefresher)]: [empty] -> wait
15:29:54 2019-07-15T03:29:54.423Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterRefresher)]: wait -> connecting
15:29:54 2019-07-15T03:29:54.424Z ioredis:redis queue command[172.30.2.200:6379 (ioredisClusterRefresher)]: 0 -> cluster([ 'slots' ])
15:29:54 2019-07-15T03:29:54.425Z ioredis:cluster:subscriber selected a subscriber 172.30.2.200:6379
15:29:54 2019-07-15T03:29:54.425Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterSubscriber)]: [empty] -> wait
15:29:54 2019-07-15T03:29:54.425Z ioredis:cluster:subscriber started
15:29:54 2019-07-15T03:29:54.462Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterRefresher)]: connecting -> connect
15:29:54 2019-07-15T03:29:54.462Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterRefresher)]: connect -> ready
15:29:54 2019-07-15T03:29:54.462Z ioredis:connection set the connection name [ioredisClusterRefresher]
15:29:54 2019-07-15T03:29:54.462Z ioredis:redis write command[172.30.2.200:6379 (ioredisClusterRefresher)]: 0 -> client([ 'setname', 'ioredisClusterRefresher' ])
15:29:54 2019-07-15T03:29:54.463Z ioredis:connection send 1 commands in offline queue
15:29:54 2019-07-15T03:29:54.464Z ioredis:redis write command[172.30.2.200:6379 (ioredisClusterRefresher)]: 0 -> cluster([ 'slots' ])
15:29:54 2019-07-15T03:29:54.483Z ioredis:cluster cluster slots result count: 1
15:29:54 2019-07-15T03:29:54.483Z ioredis:cluster cluster slots result [0]: slots 0~16383 served by 172.30.100.206:6379,172.30.100.206:6379
15:29:54 2019-07-15T03:29:54.542Z ioredis:cluster:connectionPool Reset with [ { host: '172.30.100.206', port: 6379, readOnly: false },{ host: '172.30.100.206', port: 6379, readOnly: true } ]
15:29:54 2019-07-15T03:29:54.542Z ioredis:cluster:connectionPool Disconnect 172.30.2.200:6379 because the node does not hold any slot
15:29:54 2019-07-15T03:29:54.543Z ioredis:redis status[172.30.2.200:6379]: wait -> close
15:29:54 2019-07-15T03:29:54.543Z ioredis:connection skip reconnecting since the connection is manually closed.
15:29:54 2019-07-15T03:29:54.543Z ioredis:redis status[172.30.2.200:6379]: close -> end
15:29:54 2019-07-15T03:29:54.543Z ioredis:cluster:connectionPool Connecting to 172.30.100.206:6379 as slave
15:29:54 2019-07-15T03:29:54.543Z ioredis:redis status[172.30.100.206:6379]: [empty] -> wait
15:29:54 2019-07-15T03:29:54.544Z ioredis:cluster status: connecting -> connect
15:29:54 2019-07-15T03:29:54.562Z ioredis:cluster Ready check failed (Error: Connection is closed.). Reconnecting...
15:29:54 2019-07-15T03:29:54.563Z ioredis:cluster status: connect -> disconnecting
15:29:54 2019-07-15T03:29:54.563Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterSubscriber)]: wait -> close
15:29:54 2019-07-15T03:29:54.563Z ioredis:connection skip reconnecting since the connection is manually closed.
15:29:54 2019-07-15T03:29:54.563Z ioredis:redis status[172.30.2.200:6379 (ioredisClusterSubscriber)]: close -> end
15:29:54 2019-07-15T03:29:54.563Z ioredis:cluster:subscriber stopped
15:29:54 2019-07-15T03:29:54.563Z ioredis:cluster:connectionPool Reset with []
15:29:54 2019-07-15T03:29:54.563Z ioredis:cluster:connectionPool Disconnect 172.30.2.200:6379 because the node does not hold any slot
15:29:54 2019-07-15T03:29:54.563Z ioredis:cluster:connectionPool Disconnect 172.30.100.206:6379 because the node does not hold any slot
15:29:54 2019-07-15T03:29:54.563Z ioredis:redis status[172.30.100.206:6379]: wait -> close
15:29:54 2019-07-15T03:29:54.564Z ioredis:connection skip reconnecting since the connection is manually closed.
15:29:54 2019-07-15T03:29:54.564Z ioredis:redis status[172.30.100.206:6379]: close -> end
15:29:54 2019-07-15T03:29:54.564Z ioredis:cluster status: disconnecting -> close
15:29:54 2019-07-15T03:29:54.565Z	07cb3dc6-a1af-41d0-ba24-94582a2d7bf8	INFO	clusterRetryStrategy called 1 [ { host: 'ec2-XX-X-XX-XX.compute-1.amazonaws.com', port: 6379 } ]
15:29:54 2019-07-15T03:29:54.565Z ioredis:cluster status: close -> reconnecting
... and so on

I have been able to fix this with a small change to the built ioredis code in cluster/index.js:153 by simply surrounding the call to the readyCheck with process.nextTick:

process.nextTick(() => {
    this.readyCheck((err, fail) => {
        if (err || fail) {
            debug("Ready check failed (%s). Reconnecting...", err || fail);
            if (this.status === "connect") {
                this.disconnect(true);
            }
        }
        else {
            readyHandler.call(this);
        }
    });
});

I don't know why this works. I'm hoping either this will help you make a better fix than this, or if you are happy with this fix I am happy to submit a PR.

ioredis 4.11.2 (+ previous versions)

The text was updated successfully, but these errors were encountered:

luin · 2019-07-15T09:38:53Z

There isn't debug info about NAT mapping %s -> %O, which is a little strange since if there's a match on the NAT map, a log will be printed: https://github.com/luin/ioredis/blob/master/lib/cluster/index.ts#L537.

Another strange thing is the cluster slots command returns two same nodes:

15:29:54 2019-07-15T03:29:54.483Z ioredis:cluster cluster slots result [0]: slots 0~16383 served by 172.30.100.206:6379,172.30.100.206:6379

Which seems to be the problem. Did your nat mapper map different nodes into the same one?

jeremytm · 2019-07-16T02:25:57Z

I fixed the natMap, however the issue still remains.

natMap:

{
  'rugby-cache-0001-001.e5di5i.0001.use1.cache.amazonaws.com:6379': {
    host: 'ec2-52-4-26-14.compute-1.amazonaws.com',
    port: 6379
  }
}

I added a few debugs to the natMapper function and confirmed it's looking for an ip as the key rather than hostname.

The reason for using the DNS name is as recommended in Amazon ElastiCache FAQs:

Please note that, we strongly recommend you use the DNS Names to connect to your nodes as the underlying IP addresses can change (e.g., after cache node replacement).

In any case, even if I change the napMap to use the ip rather than hostname, it still continually tries to reconnect in a loop. Can you see any other things I should try?

hosts [ { host: 'ec2-52-4-26-14.compute-1.amazonaws.com', port: 6379 } ]
options { natMap:
   { '172.30.100.206:6379':
      { host: 'ec2-52-4-26-14.compute-1.amazonaws.com', port: 6379 } } }
  ioredis:cluster status: [empty] -> connecting +0ms
  ioredis:cluster resolved hostname ec2-52-4-26-14.compute-1.amazonaws.com to IP 52.4.26.14 +4ms
  ioredis:cluster:connectionPool Reset with [ { host: '52.4.26.14', port: 6379 } ] +0ms
  ioredis:cluster:connectionPool Connecting to 52.4.26.14:6379 as master +0ms
  ioredis:redis status[52.4.26.14:6379]: [empty] -> wait +0ms
  ioredis:cluster getting slot cache from 52.4.26.14:6379 +3ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterRefresher)]: [empty] -> wait +1ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterRefresher)]: wait -> connecting +1ms
  ioredis:redis queue command[52.4.26.14:6379 (ioredisClusterRefresher)]: 0 -> cluster([ 'slots' ]) +1ms
  ioredis:cluster:subscriber selected a subscriber 52.4.26.14:6379 +0ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterSubscriber)]: [empty] -> wait +1ms
  ioredis:cluster:subscriber started +0ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterRefresher)]: connecting -> connect +200ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterRefresher)]: connect -> ready +0ms
  ioredis:connection set the connection name [ioredisClusterRefresher] +0ms
  ioredis:redis write command[52.4.26.14:6379 (ioredisClusterRefresher)]: 0 -> client([ 'setname', 'ioredisClusterRefresher' ]) +0ms
  ioredis:connection send 1 commands in offline queue +1ms
  ioredis:redis write command[52.4.26.14:6379 (ioredisClusterRefresher)]: 0 -> cluster([ 'slots' ]) +1ms
  ioredis:cluster cluster slots result count: 1 +401ms
  ioredis:cluster NAT mapping 172.30.100.206:6379 -> { host: 'ec2-52-4-26-14.compute-1.amazonaws.com', port: 6379 } +0ms
  ioredis:cluster NAT mapping 172.30.100.206:6379 -> { host: 'ec2-52-4-26-14.compute-1.amazonaws.com', port: 6379 } +0ms
  ioredis:cluster cluster slots result [0]: slots 0~16383 served by ec2-52-4-26-14.compute-1.amazonaws.com:6379,ec2-52-4-26-14.compute-1.amazonaws.com:6379 +0ms
  ioredis:cluster:connectionPool Reset with [ { host: 'ec2-52-4-26-14.compute-1.amazonaws.com',
  ioredis:cluster:connectionPool     port: 6379,
  ioredis:cluster:connectionPool     readOnly: false },
  ioredis:cluster:connectionPool   { host: 'ec2-52-4-26-14.compute-1.amazonaws.com',
  ioredis:cluster:connectionPool     port: 6379,
  ioredis:cluster:connectionPool     readOnly: true } ] +403ms
  ioredis:cluster:connectionPool Disconnect 52.4.26.14:6379 because the node does not hold any slot +0ms
  ioredis:redis status[52.4.26.14:6379]: wait -> close +198ms
  ioredis:connection skip reconnecting since the connection is manually closed. +198ms
  ioredis:redis status[52.4.26.14:6379]: close -> end +0ms
  ioredis:cluster:connectionPool Connecting to ec2-52-4-26-14.compute-1.amazonaws.com:6379 as slave +1ms
  ioredis:redis status[ec2-52-4-26-14.compute-1.amazonaws.com:6379]: [empty] -> wait +0ms
  ioredis:cluster status: connecting -> connect +2ms
  ioredis:cluster Ready check failed (Error: Connection is closed.). Reconnecting... +2ms
  ioredis:cluster status: connect -> disconnecting +0ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterSubscriber)]: wait -> close +3ms
  ioredis:connection skip reconnecting since the connection is manually closed. +3ms
  ioredis:redis status[52.4.26.14:6379 (ioredisClusterSubscriber)]: close -> end +0ms
  ioredis:cluster:subscriber stopped +402ms
  ioredis:cluster:connectionPool Reset with [] +3ms
  ioredis:cluster:connectionPool Disconnect 52.4.26.14:6379 because the node does not hold any slot +0ms
  ioredis:cluster:connectionPool Disconnect ec2-52-4-26-14.compute-1.amazonaws.com:6379 because the node does not hold any slot +0ms
  ioredis:redis status[ec2-52-4-26-14.compute-1.amazonaws.com:6379]: wait -> close +0ms
  ioredis:connection skip reconnecting since the connection is manually closed. +0ms
  ioredis:redis status[ec2-52-4-26-14.compute-1.amazonaws.com:6379]: close -> end +0ms
  ioredis:cluster status: disconnecting -> close +1ms
  ioredis:cluster status: close -> reconnecting +1ms
  ioredis:cluster Cluster is disconnected. Retrying after 102ms +104ms
  ioredis:cluster status: reconnecting -> connecting +1ms

It still works fine if I set enableReadyCheck to false.

luin · 2019-07-16T10:07:32Z

Thanks for the details. I've spotted the root reason for this issue. The issue is caused by the ready check command being sent to the node (52.4.26.14) that was disconnecting, instead of the new node (ec2-52-4-26-14.compute-1.amazonaws.com). Wrapping readyCheck with process.nextTick do the trick because the old node will be removed from the connection pool in the next tick so that the command will be sent to the new node.

I'll create a fix for that this week. For now, you can safely turn enableReadyCheck off.

Close: #930

## [4.12.2](v4.12.1...v4.12.2) (2019-07-16) ### Bug Fixes * **cluster:** prefer master when there're two same node for a slot ([8fb9f97](8fb9f97)) * **cluster:** remove node immediately when slots are redistributed ([ecc13ad](ecc13ad)), closes [#930](#930)

ioredis-robot · 2019-07-16T15:16:50Z

🎉 This issue has been resolved in version 4.12.2 🎉

The release is available on:

Your semantic-release bot 📦🚀

## [4.12.2](redis/ioredis@v4.12.1...v4.12.2) (2019-07-16) ### Bug Fixes * **cluster:** prefer master when there're two same node for a slot ([8fb9f97](redis/ioredis@8fb9f97)) * **cluster:** remove node immediately when slots are redistributed ([ecc13ad](redis/ioredis@ecc13ad)), closes [#930](redis/ioredis#930)

luin mentioned this issue Jul 16, 2019

Make NAT works with enableReadyCheck #931

Merged

luin added a commit that referenced this issue Jul 16, 2019

fix(cluster): remove node immediately when slots are redistributed

a36113d

Close: #930

luin added a commit that referenced this issue Jul 16, 2019

fix(cluster): remove node immediately when slots are redistributed

b430d0e

Close: #930

luin closed this as completed in #931 Jul 16, 2019

luin added a commit that referenced this issue Jul 16, 2019

fix(cluster): remove node immediately when slots are redistributed

ecc13ad

Close: #930

ioredis-robot added the released label Jul 16, 2019

snyk-bot mentioned this issue Sep 26, 2019

[Snyk] Upgrade ioredis from 4.9.0 to 4.14.1 chainoy1/probot#2

Closed

snyk-bot mentioned this issue Nov 1, 2019

[Snyk] Upgrade ioredis from 4.9.0 to 4.14.2 chainoy1/probot#23

Closed

snyk-bot mentioned this issue Nov 18, 2019

[Snyk] Upgrade ioredis from 4.9.0 to 4.14.3 chainoy1/probot#29

Closed

snyk-bot mentioned this issue Dec 22, 2019

[Snyk] Upgrade ioredis from 4.9.0 to 4.15.0 chainoy1/probot#33

Closed

snyk-bot mentioned this issue Jan 20, 2020

[Snyk] Upgrade ioredis from 4.9.0 to 4.15.1 chainoy1/probot#38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't connect to cluster via NAT with enableReadyCheck option #930

Can't connect to cluster via NAT with enableReadyCheck option #930

jeremytm commented Jul 15, 2019 •

edited

Loading

luin commented Jul 15, 2019 •

edited

Loading

jeremytm commented Jul 16, 2019 •

edited

Loading

luin commented Jul 16, 2019 •

edited

Loading

ioredis-robot commented Jul 16, 2019

Can't connect to cluster via NAT with enableReadyCheck option #930

Can't connect to cluster via NAT with enableReadyCheck option #930

Comments

jeremytm commented Jul 15, 2019 • edited Loading

luin commented Jul 15, 2019 • edited Loading

jeremytm commented Jul 16, 2019 • edited Loading

luin commented Jul 16, 2019 • edited Loading

ioredis-robot commented Jul 16, 2019

jeremytm commented Jul 15, 2019 •

edited

Loading

luin commented Jul 15, 2019 •

edited

Loading

jeremytm commented Jul 16, 2019 •

edited

Loading

luin commented Jul 16, 2019 •

edited

Loading