Add retry mechanism to handle connection reset by peer for Conjur reads #63

diverdane · 2020-03-23T13:13:33Z

Is your feature request related to a problem? Please describe.

An issue came up in a customer support case where the PCF conjur-env binary that was being used to retrieve secrets was reporting "read: connection reset by peer" errors. This in turn was causing the customer application to read a null string for a port environment variable.

The connection reset seemed to be coming from something in the network, not from Conjur itself: follower logs showed no errors, and no signs that a request for the port variable was ever received. Best guess is that these connection resets are coming from something in the customer topology, e.g. a load balancer or something like that. Whatever was doing the connect reset must have been experiencing CPU or buffer overload, maybe because of bursty network traffic, or lots of concurrent messages being received. (Reference: golang/go#20960).

Since these errors only happened in one setup in the customer's network (1 out of 100 or more), and the errors in that setup were intermittent (happening ~50% of the time), it is clear that a retry mechanism on the Conjur API side would help recover gracefully in cases such as these.

It would also help to have a limit on the number of concurrent messages being sent by the Conjur API.

The proposal is to add a retry mechanism to the Conjur Golang API for network errors such as ECONNRESET or ECONNABORTED. A PoC has been created by @doodlesbykumbi: https://github.com/doodlesbykumbi/cloudfoundry-conjur-buildpack/pull/1/files

Ideally, time permitting, it would be really good to have the retry parameters configurable e.g. in a secretless.yml configuration.

Describe the solution you would like

Queries of Conjur variables get retried for ECONNRESET or ECONNABORTED errors. The character of the retries can be:

Minimum: Always enabled, fixed number of retries, fixed delay between retry attempts
Nice-to-Have: Enabling, number of retries, and retry delay configurable in a secretless.yml file.
Nice-to-Have: Limits on concurrency configurable either in secretless.yml or in a Conjur variable.

Describe alternatives you have considered

A clear and concise description of any alternative solutions or features that may be related to this that
you have considered.

Additional context

Add any other context information about the feature request here.

The text was updated successfully, but these errors were encountered:

diverdane added component/api/go kind/enhancement labels Mar 23, 2020

sgnn7 added team/community-and-integrations severity/low good-first-issue and removed team/community-and-integrations labels Mar 23, 2020

izgeri added the source/salesforce label Mar 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry mechanism to handle connection reset by peer for Conjur reads #63

Add retry mechanism to handle connection reset by peer for Conjur reads #63

diverdane commented Mar 23, 2020 •

edited

Loading

Add retry mechanism to handle connection reset by peer for Conjur reads #63

Add retry mechanism to handle connection reset by peer for Conjur reads #63

Comments

diverdane commented Mar 23, 2020 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you would like

Describe alternatives you have considered

Additional context

diverdane commented Mar 23, 2020 •

edited

Loading