Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.7] Description for host.name updated (FQDN issue) #2122

Merged
merged 5 commits into from
Jan 25, 2023

Conversation

mitodrummer
Copy link
Contributor

@mitodrummer mitodrummer commented Dec 13, 2022

Updates the description for the host.name field to encourage the use of lowercase FQDN as a value for host.name.

Taken from the linked issue where the proposal for the change was submitted:
All Elastic ECS producers should populate the host.name field with the lowercased FQDN from here forward.
The ECS definition for host.name should be updated to recommend the use of lowercase FQDN. e.g., "Name of the host. It can contain what hostname returns on Unix systems, the fully qualified domain name (FQDN), or a name specified by the user. The recommended value is the lowercase FQDN of the host."

@ebeahan please advise if you'd like this change to go through the RFC process.

The description used was taken from this discussion: elastic/beats#1070 (comment)

  • Have you signed the contributor license agreement? y
  • Have you followed the contributor guidelines? y
  • For proposing substantial changes or additions to the schema, have you reviewed the RFC process? y
  • If submitting code/script changes, have you verified all tests pass locally using make test? y
  • If submitting schema/fields updates, have you generated new artifacts by running make and committed those changes? y
  • Is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed. y
  • Have you added an entry to the CHANGELOG.next.md? y

@joshdover
Copy link

Is there a way to go ahead and document that we'd like this recommendation to be required in ECS 9.0 as a breaking change? Or where are we tracking future breaking changes to ECS?

@ebeahan
Copy link
Member

ebeahan commented Dec 15, 2022

@joshdover @mitodrummer have you checked in with anyone from observability to make sure they're aware of this host.name recommendation change?

maybe @ruflin?

@ruflin
Copy link
Member

ruflin commented Dec 16, 2022

Can we have a more detailed description in the PR what this change exactly means? Is it a breaking change? Linking to a discussion is useful to provide additional details but lets make sure all the necessary details are in the PR itself.

@ruflin
Copy link
Member

ruflin commented Dec 16, 2022

In general I like the idea of moving to fqnd as recommendation for host.name (instead of introducing a new field). Has anyone thought about how the migration path might look like? Lets assume today host.name is not shipped as fqdn and then in the next minor or major as fqnd. Assuming host.name is a dimension for TSDB, suddenly these are all new time series. Queries run for a specific host name stop working. Terms aggregations for longer time periods where both data sets exists should twice as many hosts as there are.

As part of this change, we should come up with a recommended migration path instead of just putting the change out there and hope everyone will figure it out.

@joshdover
Copy link

Lets assume today host.name is not shipped as fqdn and then in the next minor or major as fqnd. Assuming host.name is a dimension for TSDB, suddenly these are all new time series. Queries run for a specific host name stop working. Terms aggregations for longer time periods where both data sets exists should twice as many hosts as there are.

It's been discussed to some extent in the referenced discussion, but agree we should make this clear here. The issue today is that this data is likely already incorrect and confusing for users if there are multiple hosts with the same hostname, but on different domains. In such a case you could have 2 or more different hosts reporting on the same time series, which will basically make the reported data meaningless.

That said, the new time series problem is a real one and changing this will disrupt a user's ability to analyze data from a specific host over a time period that includes the switch. In terms of how we could mitigate that, I see a few options:

  1. Allow users to make a choice about this change by opting in or out
  2. Don't change host.name and instead add a new host.fqdn or similar field
  3. Document or include a workaround using a runtime field that concatenates host.hostname and host.domain for old data that doesn't use fqdn for host.name
    • Docs could explain how to create this runtime field in Kibana and then how to clone integration dashboards to use it
    • I'm not 100% sure that concatenating will always produce the right result for a fqdn?

My understanding is that most O11y data usually have fairly short data retention periods (up to 90 days) so this will be a temporary problem. Security data is more likely to be retained for very long periods of time and my understanding from the SIEM input here is that this is already a problem that customers would prefer to see solved sooner than later.

In practice, we also haven't started using TSDB time series mappings in our integrations yet, and this change would likely ship in the same version we start doing so (8.7). This may be a nice lucky timing which could help mitigate the concern.

Overall, I prefer we move forward with this proposal but include something like (3) for users that need to do further lookbacks.

@ruflin
Copy link
Member

ruflin commented Dec 16, 2022

My understanding is that most O11y data usually have fairly short data retention periods (up to 90 days)

Don't think this applies to logs and also there are quite a few metrics uses cases out there that have much longer retention.

Comparing 1 and 3 above: 3 requires centralised logic to do the change. It works well in the context where Fleet is in place but for self managed setups, this will be tricky. I keep coming back to 1 as the most promising option. My ideal scenario:

  • Existing installation do not change even if Elastic Agent is upgraded
  • New installations directly have fqnd
  • Existing installation change on a major upgrade

Maybe we can go with "Existing installations changes on minor upgrade if config flag is not set". It means a potential breaking change for a user but the user at least could opt out.

@norrietaylor
Copy link
Member

@joshdover and @ruflin
Is it safe to assume from the conversation here the two of you agree with the migration plan as stated above? ie This will only take effect on new installations until a major version change?

What else is needed for approval on this change to the ECS description?

@ruflin
Copy link
Member

ruflin commented Dec 20, 2022

@norrietaylor Seems like we agree. But it all feels a bit theoretical and it would be great if someone could play around with this change and see what effects it has.

@joshdover
Copy link

joshdover commented Dec 20, 2022

I think the implementation of detecting of new vs. existing install would be non-trivial and I wonder if we could get away with making it completely opt-in from the Fleet UI until the next major version. I imagine this could be an advanced setting on the Agent Policy or global Fleet Settings tab.

@nimarezainia
Copy link

@joshdover @ruflin sorry to open this can again. I just want to get clarity on how we would achieve what is being suggested above for all the different products and installation methods we have:

(A) Recognizing a New Install

  • for a Fleet managed Agent we can tell it's a new installation, upgrades happen via fleet
  • For RPM/DEB installs we are not in charge of upgrading, that happens through the tool. I am not certain whether an upgrade in their context is a re-install or not
  • For Beats and Stand alone agent, I believe that an upgrade is essentially a reinstall of the binary
    For Beats and Standalone agent how will we differentiate between installation and an upgrade?

(B) Toggle switch or feature flag

  • similarly applicable to Fleet managed agents only
    How would this work for Beats and Standalone?

I want to avoid the complexity that the logic above may introduce, equally really don't want to break existing customers. That's why we originally recommended a new ECS field. As Josh said if the user is relying on host.name, queries are broken for them already. So I am wondering if we bite the bullet now, avoid the complexity, and introduce a breaking change and communicate the hell out of it to the user base OR alternatively wait until 9.0 to introduce this.

@MikePaquette would love to get your opinion on this. From the original discussions the preference seemed to be to use the same field.

@joshdover
Copy link

(A) Recognizing a New Install

I don't think automatically determining what the user wants will be that simple, since they probably want the same behavior for all agents, including ones that were enrolled or configured recently. They may even expect the same behavior if they create a new cluster (eg. a test and prod). If our heuristic gets any of this wrong, we will have a non-obvious breaking change.

(B) Toggle switch or feature flag

* similarly applicable to Fleet managed agents only
  How would this work for Beats and Standalone?

We would need to have a new config flag somewhere in *beat.yml and elastic-agent.yml. For Beats, probably the general settings would make sense. For agent, probably somewhere under agent.*

@ferullo
Copy link

ferullo commented Jan 4, 2023

Endpoint would need to be passed this config value as well

@joshdover
Copy link

joshdover commented Jan 10, 2023

@nimarezainia what do you think about the proposal above? We need to align on this as some development work has already begun. To summarize again, the proposal would be:

  • Make this change opt-in for 8.x, in 9.0 change the default to FQDN, but still allow opt-out for BWC
  • For Beats and standalone Agent, this is configured via a new yaml setting
  • Agent would pass this configuration down to each sub-process via the control protocol
  • For Fleet-managed Agent, a new global and agent policy toggle would be available in the UI and via kibana.yml preconfiguration
  • Presumably, this field would be added by the Agent shipper in the future, instead of managed by each subprocess

If we have alignment on this, we need to schedule more tasks across the various projects to enable this.

@epixa
Copy link
Contributor

epixa commented Jan 19, 2023

I don't think we should reuse host.name for this at all. We're not talking about fixing a bug or improving documentation, we are removing a feature entirely - the ability for a sender to arbitrarily name hosts. We don't have concrete data on who is using this, but anyone that is using this will have no official alternative to a feature we've documented in ECS since at least 1.0.

I don't have all of the context, but from the comments in this PR it seems clear to me that what we want from this field is something differently entirely. We want a universal string representation of the originating host of data, which is not what host.name is. In fact, that's essentially what the entire host fieldset is suppose to be, but if we want to represent it as a single string, it should just be a new field because it is a new piece of data.

@joshdover
Copy link

joshdover commented Jan 23, 2023

We're not talking about fixing a bug or improving documentation, we are removing a feature entirely - the ability for a sender to arbitrarily name hosts. We don't have concrete data on who is using this, but anyone that is using this will have no official alternative to a feature we've documented in ECS since at least 1.0.

Practically, I think introducing a new field for what I believe is the intended usage of this field (from a query/read perspective - split/count/agg data by unique host) is going to require a lot of work across integrations, alerting, Fleet, and user-written queries. I don't think this work would even have an end, as I expect the habit of reaching for host.name is so commonplace that we'd probably end up using it in new integrations as well before fixing it.

If allowing users to populate their own custom name is an intended use case as well, then I'd propose we:

  • Don't add a new field for FQDN - no sense in storing more data for customers who don't need this
  • Update the ECS guidance to prefer FQDN, but don't make a breaking change in 9.0 or at any point
  • Make it easier for users to customize what is sent in the host.name from Elastic Agent (and it's underlying processes). This should be settable via yaml configuration for standalone Agent and via Fleet UI for managed Agent. For now we could support only the current behavior (hostname) and add FQDN. In the future we could allow users to customize it to a static or dynamic value based on variables from providers.

This is essentially the same proposal as before, but we won't plan to make any breaking changes in the future. The only use case this proposal wouldn't cover is users who want a customizable host.name and a separate field for FQDN. If this is also a desired use case, we could later consider adding FQDN as a separate field that is always populated, however I have not seen this request. I have only seen requests that the host.name field which is already used everywhere in the product can be set to the FQDN.

@nimarezainia
Copy link

I agree with you @joshdover but would like to hear from @epixa
Customizing the hostname was never a requirement for this feature. Here we are just enabling the user to choose FQDN over hostname and will avoid any breaking change.

@epixa
Copy link
Contributor

epixa commented Jan 23, 2023

@joshdover I have no objection to what you just proposed

@joshdover
Copy link

I think we are all aligned on the implementation proposal now and it still matches the content of this PR itself. @elastic/ecs we still need approval on this PR to be able to merge it into the spec.

@ebeahan ebeahan changed the title [8.6] Description for host.name updated (FQDN issue) [8.7] Description for host.name updated (FQDN issue) Jan 25, 2023
CHANGELOG.next.md Outdated Show resolved Hide resolved
@nicpenning
Copy link
Contributor

Lets assume today host.name is not shipped as fqdn and then in the next minor or major as fqnd. Assuming host.name is a dimension for TSDB, suddenly these are all new time series. Queries run for a specific host name stop working. Terms aggregations for longer time periods where both data sets exists should twice as many hosts as there are.

It's been discussed to some extent in the referenced discussion, but agree we should make this clear here. The issue today is that this data is likely already incorrect and confusing for users if there are multiple hosts with the same hostname, but on different domains. In such a case you could have 2 or more different hosts reporting on the same time series, which will basically make the reported data meaningless.

FYI - I posted my 2 cents here: elastic/kibana#150239 (comment)

The damage has been done by making host.name not fqdn by default with the Elastic Agent 8.7.0 change above. (Speaking from a windows environment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants