Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infrastructure connectivity checks #29439

Open
31 tasks
PVince81 opened this issue Oct 26, 2021 · 6 comments
Open
31 tasks

Infrastructure connectivity checks #29439

PVince81 opened this issue Oct 26, 2021 · 6 comments
Labels
1. to develop Accepted and waiting to be taken care of enhancement feature: settings

Comments

@PVince81
Copy link
Member

Problem

Receiving reports that are not actually bugs but connectivity issues with components, for example slow LDAP servers, or unstable connections to the database or redis.

Solution

Having automated checks in place that makes it possible for admins to verify their infrastructure connectivity before reporting issues. The goal is to prevent people from sending issues that are not actual bugs.

Implementation idea

Nextcloud already has a "setup check" section in the settings where various settings are getting verified.

This section should be extended to include a table with two columns connectivity and reliability, and with rows representing the various components that are to be checked.

With connectivity we mean "is it accessible at all" and with reliability we mean "how many connection failures / timeouts in past time intervals (hours, days, weeks)"

Possible components to check that would appear as rows / grouped rows:

  • Database
  • Redis
  • Local filesystem / Primary object store
  • LDAP servers (multiple)
  • external storages (multiple)
  • office document servers
  • ...

Implementation details

Status provider service

  • Provide a new PHP service class in the server for apps to register "connectivity status providers"
    • Status provider method getType(): string for displaying in the table as prefix for a component type (and for grouping)
    • Status provider method getDisplayName(): string for displaying in the table, it must be useful enough for the admin to find out which exact component needs attention
    • Status provider a method checkConnectivity(): void that does an immediate connectivity check to one given component type, like for example a specific LDAP server. Only a connection is done, the measurement is done by the caller.
    • Status provider method getFailures(): array, returns array of timestamps for last failures.
      • ❓ should we include error messages to be able to aggregate there or at least show it somewhere ?
    • If multiple components of the same type, the app/implementor must register multiple status providers, one for each (ex: one provider per LDAP server, but using the same type name)
    • ❓ Make the registration lazy, maybe through an event that we only trigger from the connectivity check controller ?
  • All the connectivity status provider will be queried whenever the settings page is queried
    • For each status provider, call "checkConnectivity()" and measure the time taken to respond and put it in the table
    • For each status provider, call "getFailures()" and summarize the frequency of failures in the table
    • The table needs time interval labels for "absent", "slow", "ok" (depending on how much time it took)
    • Implement UI + table to show results

Intermittent failure tracking

  • Every app that manages components (ex: LDAP) must catch connection failures like timeouts and send it to the status provider service

    • Status provider service will store the event with timestamp somewhere (database?)
  • Component types (raise tickets when ready)

    • Database: implement in core
    • Redis: implement in core
    • Object Store Primary storage: implement in core
    • LDAP: user_ldap app must return one status provider per LDAP server
    • External storage: files_external must return one status provider per mount
      • ❓ what to do when the list is very long ? aggregate ?
      • ❓ what about personal mounts ?
    • Document server: OnlyOffice / Collabora apps
    • Others ?

Development phases

  • Phase 1: implement the status provider service in the server. This qualifies as API changes/addition so should ideally be released as part of a major relase
  • Phase 2 / parallel: implement the status providers for each component, can be done in parallel and released independently in minor/patch releases

Open issues

  • See ":question:" entries in checklist
  • The name "status providers" not very catching, any better ideas ?
  • Should the responsibility of tracking and measuring be done by the apps / implementors or rather by the connectivity service itself ?
  • Expand with more technical clarifications in the concept
@PVince81 PVince81 added enhancement 1. to develop Accepted and waiting to be taken care of labels Oct 26, 2021
@come-nc
Copy link
Contributor

come-nc commented Oct 26, 2021

How do you handle the fact that checkConnectivity may hang for a "long" time? Are the calls on checkConnectivity done at the same time or one after another, and what does the page shows while waiting for the timeouts?
Are the timeouts expected to be handled by each provider as he wishes? For LDAP I do not see a timeout option in the settings, do the other components have their own timeout in configuration?

@PVince81
Copy link
Member Author

How do you handle the fact that checkConnectivity may hang for a "long" time? Are the calls on checkConnectivity done at the same time or one after another, and what does the page shows while waiting for the timeouts?

  • revisit concept to include the ability to run parallel checks
    • would require the provider list to be returned to the frontend first
    • frontend would run parallel XHR calls to endpoints for each provider

Are the timeouts expected to be handled by each provider as he wishes? For LDAP I do not see a timeout option in the settings, do the other components have their own timeout in configuration?

Yes, I'd say components know better how to configure the timeouts for their libraries (ex: Redis), and they need to catch the matching exceptions to track those timeouts.

@blizzz
Copy link
Member

blizzz commented Oct 26, 2021

For LDAP I do not see a timeout option in the settings, do the other components have their own timeout in configuration?

I will bring one in soon for currently running case, atm i have a simple patch (amongst others) that need to be polished.

@blizzz
Copy link
Member

blizzz commented Oct 26, 2021

External storage: files_external must return one status provider per mount

Should not be one per host sufficient? If there is a meaningful connection possible to be made without auth (should be, right). Otherwise it will be just terrible.

@PVince81
Copy link
Member Author

External storage: files_external must return one status provider per mount

Should not be one per host sufficient? If there is a meaningful connection possible to be made without auth (should be, right). Otherwise it will be just terrible.

once per host would be nice, but would only work when no auth is set or if the credentials are identical

@blizzz
Copy link
Member

blizzz commented Oct 28, 2021

once per host would be nice, but would only work when no auth is set or if the credentials are identical

for connectivity test it should be enough to see if the server replies reasonably, i.e. requiring to auth. 30k checks against a service might look like a DoS attempt :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1. to develop Accepted and waiting to be taken care of enhancement feature: settings
Projects
None yet
Development

No branches or pull requests

4 participants