Skip to content
This repository has been archived by the owner on Sep 17, 2024. It is now read-only.

No datastreams are listed on Fleet #1274

Closed
mdelapenya opened this issue Jun 21, 2021 · 11 comments
Closed

No datastreams are listed on Fleet #1274

mdelapenya opened this issue Jun 21, 2021 · 11 comments
Assignees
Labels
area:test Anything related to the Test automation bug Something isn't working impact:critical Immediate priority; high value or cost to the product. priority:blocker Work is on-hold for a product team, business is at risk until resolution of issue requested-by:Fleet size:S less than 1 day Team:Automation Label for the Observability productivity team Team:Fleet Label for the Fleet team triaged Triaged issues will end up in Backlog column in Robots GH Project

Comments

@mdelapenya
Copy link
Contributor

The step system package dashboards are listed in Fleet is failing because it does not faind any data stream in the max timeout (3min).

It fails on Centos and Debian, in both AMD and ARM.

Steps to reproduce

1 Run:

$ TAGS="fleet_mode_agent && install && centos" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true ELASTIC_APM_ACTIVE=false make -C e2e/_suites/fleet functional-test

Expected behaviour: the scenario passes
Current behaviour:
--- Failed steps:

  Scenario Outline: Deploying the centos agent # features/fleet_mode_agent.feature:6
    And system package dashboards are listed in Fleet # features/fleet_mode_agent.feature:10
      Error: There are no datastreams yet


1 scenarios (1 failed)
4 steps (3 passed, 1 failed)
5m22.255990276s
make: *** [functional-test] Error 1
@mdelapenya mdelapenya added bug Something isn't working Team:Automation Label for the Observability productivity team Team:Fleet Label for the Fleet team area:test Anything related to the Test automation size:S less than 1 day triaged Triaged issues will end up in Backlog column in Robots GH Project requested-by:Fleet impact:critical Immediate priority; high value or cost to the product. labels Jun 21, 2021
@mdelapenya mdelapenya self-assigned this Jun 21, 2021
@mdelapenya mdelapenya added the priority:blocker Work is on-hold for a product team, business is at risk until resolution of issue label Jun 21, 2021
@mdelapenya
Copy link
Contributor Author

mdelapenya commented Jun 21, 2021

I'm currently bisecting the test execution between four kibana commits.

Screenshot 2021-06-21 at 17 45 17

Screenshot 2021-06-21 at 18 21 30

I've built and pushed Kibana images for those commits, only for AMD, as the APM-CI job is only building the AMD image. For that reason, CI builds below will contain errors for all ARM stages. But we do not care, as we want to verify if the image breaks the tests, never mind whether it is AMD or ARM.

❌ 1st commit: 4a941565502547f96bab72786e1ac11f61f19558 (elastic/kibana#101828)

TAGS="fleet_mode_agent && install && centos" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE KIBANA_VERSION=pr101828 DEVELOPER_MODE=true make -C e2e/_suites/fleet functional-test

UPDATE: this image contains the failed Revoke token scenario, verified locally with:

TAGS="fleet_mode_agent && revoke-token && centos" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true KIBANA_VERSION=pr101828 ELASTIC_APM_ACTIVE=false make -C e2e/_suites/fleet functional-test

❌ 2nd commit: cd5cd65fb2ec04ed63fcbc6b87f1fdb7333bee72 (elastic/kibana#102219)

TAGS="fleet_mode_agent && install && centos" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE KIBANA_VERSION=pr102219 DEVELOPER_MODE=true make -C e2e/_suites/fleet functional-test

UPDATE: this image contains the failed Revoke token scenario, verified locally with:

TAGS="fleet_mode_agent && revoke-token && centos" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true KIBANA_VERSION=pr102219 ELASTIC_APM_ACTIVE=false make -C e2e/_suites/fleet functional-test

3rd commit: 35cc59b571d19fe52eff17777a4613fd867ff928 (elastic/kibana#101835)

  • CI build: Not available yet, as the CI is building the image ATM
  • Run locally: (not possible until the CI builds the image)
TAGS="fleet_mode_agent && install && centos" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE KIBANA_VERSION=pr101835 DEVELOPER_MODE=true make -C e2e/_suites/fleet functional-test

4th commit: 6df58dd7ca53b43c2f143823ebbe51083618032b (elastic/kibana#101752)

  • CI build: Not available yet, as the CI is building the image ATM
  • Run locally: (not possible until the CI builds the image)
TAGS="fleet_mode_agent && install && centos" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE KIBANA_VERSION=pr101752 DEVELOPER_MODE=true make -C e2e/_suites/fleet functional-test

Will post results here.

@mdelapenya
Copy link
Contributor Author

mdelapenya commented Jun 21, 2021

It's weird: I've tested with an old Kibana image (pr101655, from June 8th 2021) and the test fails. I think the error is not on Kibana but in the other pieces: fleet-server or the agent. I cannot run the tests for incremental commits of the agent because the artifacts are not generated in the GCP bucket.

I'm going to locally bisect the elastic-agent image, updating the fleet-server agent and see if the problem comes from there:

docker pull docker.elastic.co/observability-ci/elastic-agent:pr-26260-amd64
docker tag docker.elastic.co/observability-ci/elastic-agent:pr-26260-amd64  docker.elastic.co/observability-ci/elastic-agent:8.0.0-SNAPSHOT

UPDATE: It is difficult to bisect, because fleet server is trying to validate a binary that is trying to install locally. This is the log output of the fleet-server:

Performing setup of Fleet in Kibana

Policy selected for enrollment:  
The Elastic Agent is currently in BETA and should not be used in production
2021-06-21T16:50:31.427Z        INFO    cmd/enroll_cmd.go:469   Spawning Elastic Agent daemon as a subprocess to complete bootstrap process.
2021-06-21T16:50:31.570Z        INFO    warn/warn.go:18 The Elastic Agent is currently in BETA and should not be used in production
2021-06-21T16:50:31.570Z        INFO    application/application.go:68   Detecting execution mode
2021-06-21T16:50:31.571Z        INFO    application/application.go:89   Agent is in Fleet Server bootstrap mode
2021-06-21T16:50:31.914Z        INFO    [api]   api/server.go:62        Starting stats endpoint
2021-06-21T16:50:31.914Z        INFO    application/fleet_server_bootstrap.go:124       Agent is starting
2021-06-21T16:50:31.914Z        INFO    [api]   api/server.go:64        Metrics endpoint listening on: /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock (configured: unix:///usr/share/elastic-agent/state/data/tmp/elastic-agent.sock)
2021-06-21T16:50:31.916Z        INFO    application/fleet_server_bootstrap.go:134       Agent is stopped
2021-06-21T16:50:32.431Z        INFO    cmd/enroll_cmd.go:611   Waiting for Elastic Agent to start Fleet Server
2021-06-21T16:50:32.633Z        INFO    stateresolver/stateresolver.go:48       New State ID is V87_qo-m
2021-06-21T16:50:32.633Z        INFO    stateresolver/stateresolver.go:49       Converging state requires execution of 1 step(s)
2021-06-21T16:50:35.243Z        INFO    operation/operation_fetch.go:75 downloaded binary 'fleet-server.8.0.0-SNAPSHOT' into '/usr/share/elastic-agent/state/data/downloads/fleet-server-8.0.0-SNAPSHOT-linux-x86_64.tar.gz' as part of operation 'operation-fetch'
2021-06-21T16:50:36.415Z        INFO    log/reporter.go:40      2021-06-21T16:50:36Z - message: Application: fleet-server--8.0.0-SNAPSHOT[]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-06-21T16:50:36.417Z        INFO    stateresolver/stateresolver.go:66       Updating internal state
2021-06-21T16:50:36.443Z        INFO    cmd/enroll_cmd.go:644   Fleet Server - Starting
2021-06-21T16:50:37.950Z        WARN    status/reporter.go:236  Elastic Agent status changed to: 'degraded'
2021-06-21T16:50:37.950Z        INFO    log/reporter.go:40      2021-06-21T16:50:37Z - message: Application: fleet-server--8.0.0-SNAPSHOT[]: State changed to DEGRADED: Running on default policy with Fleet Server integration; missing config fleet.agent.id (expected during bootstrap process) - type: 'STATE' - sub_type: 'RUNNING'
2021-06-21T16:50:38.448Z        INFO    cmd/enroll_cmd.go:625   Fleet Server - Running on default policy with Fleet Server integration; missing config fleet.agent.id (expected during bootstrap process)
2021-06-21T16:50:39.312Z        INFO    cmd/enroll_cmd.go:207   Elastic Agent has been enrolled; start Elastic Agent
2021-06-21T16:50:39.312Z        INFO    cmd/run.go:189  Shutting down Elastic Agent and sending last events...
2021-06-21T16:50:39.312Z        INFO    operation/operator.go:191       waiting for installer of pipeline 'default' to finish
2021-06-21T16:50:39.312Z        INFO    process/app.go:181      Signaling application to stop because of shutdown: fleet-server--8.0.0-SNAPSHOT
2021-06-21T16:50:39.813Z        INFO    status/reporter.go:236  Elastic Agent status changed to: 'online'
2021-06-21T16:50:39.814Z        INFO    cmd/run.go:197  Shutting down completed.
2021-06-21T16:50:39.814Z        INFO    log/reporter.go:40      2021-06-21T16:50:39Z - message: Application: fleet-server--8.0.0-SNAPSHOT[]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-06-21T16:50:39.814Z        INFO    [api]   api/server.go:66        Stats endpoint (/usr/share/elastic-agent/state/data/tmp/elastic-agent.sock) finished: accept unix /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock: use of closed network connection
Successfully enrolled the Elastic Agent.
2021-06-21T16:50:39.929Z        INFO    warn/warn.go:18 The Elastic Agent is currently in BETA and should not be used in production
2021-06-21T16:50:39.929Z        INFO    application/application.go:68   Detecting execution mode
2021-06-21T16:50:39.930Z        INFO    application/application.go:93   Agent is managed by Fleet
2021-06-21T16:50:39.930Z        INFO    capabilities/capabilities.go:59 capabilities file not found in /usr/share/elastic-agent/state/capabilities.yml
2021-06-21T16:50:40.006Z        INFO    [composable]    composable/controller.go:46     EXPERIMENTAL - Inputs with variables are currently experimental and should not be used in production
2021-06-21T16:50:40.110Z        INFO    [composable.providers.docker]   docker/docker.go:43     Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2021-06-21T16:50:40.111Z        INFO    [api]   api/server.go:62        Starting stats endpoint
2021-06-21T16:50:40.111Z        INFO    application/managed_mode.go:290 Agent is starting
2021-06-21T16:50:40.111Z        INFO    [api]   api/server.go:64        Metrics endpoint listening on: /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock (configured: unix:///usr/share/elastic-agent/state/data/tmp/elastic-agent.sock)
2021-06-21T16:50:40.214Z        WARN    application/managed_mode.go:303 failed to ack update open /usr/share/elastic-agent/state/data/.update-marker: no such file or directory
2021-06-21T16:50:40.843Z        INFO    stateresolver/stateresolver.go:48       New State ID is GZd1I8Eu
2021-06-21T16:50:40.843Z        INFO    stateresolver/stateresolver.go:49       Converging state requires execution of 2 step(s)
2021-06-21T16:50:41.426Z        INFO    operation/operator.go:259       operation 'operation-install' skipped for fleet-server.8.0.0-SNAPSHOT
2021-06-21T16:50:41.531Z        INFO    log/reporter.go:40      2021-06-21T16:50:41Z - message: Application: fleet-server--8.0.0-SNAPSHOT[69f22191-30a0-4f0a-b54e-eaab826d4a87]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-06-21T16:50:42.563Z        INFO    log/reporter.go:40      2021-06-21T16:50:42Z - message: Application: fleet-server--8.0.0-SNAPSHOT[69f22191-30a0-4f0a-b54e-eaab826d4a87]: State changed to RUNNING: Running on default policy with Fleet Server integration - type: 'STATE' - sub_type: 'RUNNING'
2021-06-21T16:50:42.598Z        ERROR   log/reporter.go:36      2021-06-21T16:50:42Z - message: Application: filebeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[69f22191-30a0-4f0a-b54e-eaab826d4a87]: State changed to FAILED: operation 'operation-verify' failed to verify filebeat.8.0.0-SNAPSHOT: 3 errors occurred:
        * fetching asc file from '/usr/share/elastic-agent/state/data/downloads/filebeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/state/data/downloads/filebeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
        * check detached signature: openpgp: invalid signature: hash tag doesn't match
        * fetching asc file from https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: call to 'https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc' returned unsuccessful status code: 404

 - type: 'ERROR' - sub_type: 'FAILED'
2021-06-21T16:50:42.598Z        ERROR   status/reporter.go:236  Elastic Agent status changed to: 'error'
2021-06-21T16:50:42.598Z        ERROR   operation/operation_retryable.go:85     operation operation-verify failed, err: operation 'operation-verify' failed to verify filebeat.8.0.0-SNAPSHOT: 3 errors occurred:
        * fetching asc file from '/usr/share/elastic-agent/state/data/downloads/filebeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/state/data/downloads/filebeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
        * check detached signature: openpgp: invalid signature: hash tag doesn't match
        * fetching asc file from https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: call to 'https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc' returned unsuccessful status code: 404


2021-06-21T16:50:43.664Z        ERROR   operation/operation_retryable.go:85     operation operation-verify failed, err: operation 'operation-verify' failed to verify metricbeat.8.0.0-SNAPSHOT: 3 errors occurred:
        * fetching asc file from '/usr/share/elastic-agent/state/data/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/state/data/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
        * check detached signature: openpgp: invalid signature: hash tag doesn't match
        * fetching asc file from https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: call to 'https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc' returned unsuccessful status code: 404


2021-06-21T16:50:43.664Z        INFO    [api]   api/server.go:66        Stats endpoint (/usr/share/elastic-agent/state/data/tmp/elastic-agent.sock) finished: accept unix /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock: use of closed network connection
Error: operator: failed to execute step sc-run, error: 2 errors occurred:
        * operation 'operation-verify' failed to verify metricbeat.8.0.0-SNAPSHOT: 3 errors occurred:
        * fetching asc file from '/usr/share/elastic-agent/state/data/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/state/data/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
        * check detached signature: openpgp: invalid signature: hash tag doesn't match
        * fetching asc file from https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: call to 'https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc' returned unsuccessful status code: 404


        * operation 'operation-verify' failed to verify metricbeat.8.0.0-SNAPSHOT: 3 errors occurred:
        * fetching asc file from '/usr/share/elastic-agent/state/data/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/state/data/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
        * check detached signature: openpgp: invalid signature: hash tag doesn't match
        * fetching asc file from https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: call to 'https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc' returned unsuccessful status code: 404



: 2 errors occurred:
        * operation 'operation-verify' failed to verify metricbeat.8.0.0-SNAPSHOT: 3 errors occurred:
        * fetching asc file from '/usr/share/elastic-agent/state/data/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/state/data/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
        * check detached signature: openpgp: invalid signature: hash tag doesn't match
        * fetching asc file from https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: call to 'https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc' returned unsuccessful status code: 404


        * operation 'operation-verify' failed to verify metricbeat.8.0.0-SNAPSHOT: 3 errors occurred:
        * fetching asc file from '/usr/share/elastic-agent/state/data/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/state/data/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
        * check detached signature: openpgp: invalid signature: hash tag doesn't match
        * fetching asc file from https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: call to 'https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc' returned unsuccessful status code: 404

@mdelapenya
Copy link
Contributor Author

Mmm, trying previous step with a clean environment, removing all services in the compose file, and now fleet-server starts properly.

But unfortunately the revoke test still fails:

--- Failed steps:

  Scenario Outline: Revoking the enrollment token for the centos agent # features/fleet_mode_agent.feature:105
    Then an attempt to enroll a new agent fails # features/fleet_mode_agent.feature:108
      Error: The agent was enrolled although the token was previously revoked

So if the fleet-server is from 8 days ago, when the tests were supposed to be passing, and the tests actually fail, I'd say it's because of another piece of the stack: it seems it's not kibana, it seems it's not fleet-server. Let's check with the agent. I'm gonna bisect the agent, although I'm seeing problems with the packaging job not producing the commits artifacts for all commits cc/ @elastic/observablt-robots

@adam-stokes
Copy link
Contributor

Could this be an elasticsearch change? Maybe the query has changed?

@adam-stokes adam-stokes self-assigned this Jun 22, 2021
@adam-stokes
Copy link
Contributor

This is due to the way we currently bring up Kibana, the environment variable XPACK_FLEET_AGENTS_FLEET_SERVER_HOSTS is not being honored properly it seems. Same reason why #1273 is failing as well. Will post a reference bug once available

@cachedout
Copy link
Contributor

@adam-stokes and @mdelapenya did #1281 fix this as well?

@mdelapenya
Copy link
Contributor Author

@adam-stokes and @mdelapenya did #1281 fix this as well?

No, that was not a solution and this issue is still under investigation. We'll post here more work about it

@EricDavisX
Copy link
Contributor

I logged a product team bug for this, thanks so much for noting it Manu in slack. def sounds like the same issue:
elastic/beats#26518

  • this has some good research in it, hopefully helpful to the team!

@adam-stokes
Copy link
Contributor

This looks to have been fixed, will wait for @mdelapenya to verify but our tests are passing again

@EricDavisX
Copy link
Contributor

the duplicate issue was re-tested and found fixed - let's not wait, i'm closing it out. there are other reasons for the tests to fail, if they still are, and we should do new tickets. : / thanks Adam.

@mdelapenya
Copy link
Contributor Author

Fixed in elastic/kibana#104415

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area:test Anything related to the Test automation bug Something isn't working impact:critical Immediate priority; high value or cost to the product. priority:blocker Work is on-hold for a product team, business is at risk until resolution of issue requested-by:Fleet size:S less than 1 day Team:Automation Label for the Observability productivity team Team:Fleet Label for the Fleet team triaged Triaged issues will end up in Backlog column in Robots GH Project
Projects
None yet
Development

No branches or pull requests

4 participants