-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Clarify field data cache behavior #64375
Changes from 2 commits
7430993
57411de
391ab35
c121d75
87e148a
784de58
b83b080
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,12 +33,14 @@ GET my-index-000001/_search | |
|
||
<1> Querying on the `_id` field (also see the <<query-dsl-ids-query,`ids` query>>) | ||
|
||
The value of the `_id` field is also accessible in aggregations or for sorting, | ||
but doing so is discouraged as it requires to load a lot of data in memory. In | ||
case sorting or aggregating on the `_id` field is required, it is advised to | ||
duplicate the content of the `_id` field in another field that has `doc_values` | ||
enabled. | ||
|
||
The `_id` field is by default not available by default for use with aggregations or sorting. | ||
To aggregate or sort by the `_id` field, it is recommended to | ||
duplicate the `_id` field onto a `keyword` field using the <<copy-to, `copy_to` mapping parameter>>. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Really small comment, the link text is usually just the parameter name: <<copy-to, `copy_to`>> There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just realized that it's not possible to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay I'll clarify that, that wasn't clear in the original text. Going to move this entire section to the top. |
||
|
||
It is not recommended to enable `_id` fields to be aggregated using the <<modules-fielddata, in-memory field data cache>>, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we soon plan to entirely remove the ability to sort/ aggregate on It looks like we forgot to mention |
||
but it is possible. This can be done by <<cluster-update-settings, changing the cluster setting>> | ||
to `"indices.id_field_data.enabled": true`. Enabling this setting and then aggregating on the `_id` | ||
field will use significant memory and show deprecation warnings in the logs. | ||
|
||
[NOTE] | ||
================================================== | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,11 +34,12 @@ to be enabled. | |
* Operations on parent and child documents from a `join` field, including | ||
`has_child` queries and `parent` aggregations. | ||
|
||
NOTE: The global ordinal mapping is an on-heap data structure. When measuring | ||
memory usage, Elasticsearch counts the memory from global ordinals as | ||
'fielddata'. Global ordinals memory is included in the | ||
<<fielddata-circuit-breaker, fielddata circuit breaker>>, and is returned | ||
under `fielddata` in the <<cluster-nodes-stats, node stats>> response. | ||
NOTE: The global ordinal mapping use heap memory as part of the | ||
jtibshirani marked this conversation as resolved.
Show resolved
Hide resolved
|
||
<<modules-fielddata, field data cache>>. Aggregations that include high | ||
cardinality values can use a significant amount of heap memory, and | ||
could exceed the threshold of the | ||
<<fielddata-circuit-breaker, field data circuit breaker>>. | ||
It is recommended to set a specific limit for the field data cache size. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are actually still discussing this recommendation in #59829, perhaps we could hold off on adding this sentence until we have a conclusion. Also maybe "Aggregations that include high cardinality values" -> "Aggregations on high cardinality fields" ? |
||
|
||
==== Loading global ordinals | ||
|
||
|
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -120,11 +120,12 @@ PUT my-index-000001/_doc/4?routing=1&refresh | |
<2> `answer` is the name of the join for this document | ||
<3> The parent id of this child document | ||
|
||
==== Parent-join and performance. | ||
==== Parent-join and performance | ||
|
||
The join field shouldn't be used like joins in a relation database. In Elasticsearch the key to good performance | ||
is to de-normalize your data into documents. Each join field, `has_child` or `has_parent` query adds a | ||
significant tax to your query performance. | ||
significant tax to your query performance. It also increases the usage of the JVM heap on the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not actually sure it's a significant contributor to heap usage, since only one |
||
<<modules-fielddata, field data cache>>. | ||
|
||
The only case where the join field makes sense is if your data contains a one-to-many relationship where | ||
one entity significantly outnumbers the other entity. An example of such case is a use case with products | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -141,3 +141,112 @@ The following parameters are accepted by `text` fields: | |
<<mapping-field-meta,`meta`>>:: | ||
|
||
Metadata about the field. | ||
|
||
[[fielddata]] | ||
==== `fielddata` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like this consolidation, it makes it clear this description only applies to |
||
|
||
`text` fields are searchable by default, but by default are not available for | ||
aggregations, sorting, or scripting. If you try to sort, aggregate, or access | ||
values from a script on a `text` field, you will see this exception: | ||
|
||
[literal] | ||
Fielddata is disabled on text fields by default. Set `fielddata=true` on | ||
[`your_field_name`] in order to load fielddata in memory by uninverting the | ||
inverted index. Note that this can however use significant memory. | ||
|
||
Field data is the only way to access the analyzed tokens from a full text field | ||
in aggregations, sorting, or scripting. For example, a full text field like `New York` | ||
would get analyzed as `new` and `york`. To aggregate on these tokens requires field data. | ||
|
||
[[before-enabling-fielddata]] | ||
==== Before enabling fielddata | ||
|
||
It usually doesn't make sense to enable fielddata on text fields. Field data | ||
is stored in the heap with the <<modules-fielddata, field data cache>> because it | ||
is expensive to calculate. Calculating the field data can cause latency spikes, and | ||
increasing heap usage is a cause of cluster performance issues. | ||
|
||
Most users who want to do more with text fields use <<multi-fields, multi-field mappings>> | ||
by having both a `text` field for full text searches, and an | ||
unanalyzed <<keyword,`keyword`>> field for aggregations, as follows: | ||
|
||
[source,console] | ||
--------------------------------- | ||
PUT my-index-000001 | ||
{ | ||
"mappings": { | ||
"properties": { | ||
"my_field": { <1> | ||
"type": "text", | ||
"fields": { | ||
"keyword": { <2> | ||
"type": "keyword" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
--------------------------------- | ||
|
||
<1> Use the `my_field` field for searches. | ||
<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts. | ||
|
||
[[enable-fielddata-text-fields]] | ||
==== Enabling fielddata on `text` fields | ||
|
||
You can enable fielddata on an existing `text` field using the | ||
<<indices-put-mapping,PUT mapping API>> as follows: | ||
|
||
[source,console] | ||
----------------------------------- | ||
PUT my-index-000001/_mapping | ||
{ | ||
"properties": { | ||
"my_field": { <1> | ||
"type": "text", | ||
"fielddata": true | ||
} | ||
} | ||
} | ||
----------------------------------- | ||
// TEST[continued] | ||
|
||
<1> The mapping that you specify for `my_field` should consist of the existing | ||
mapping for that field, plus the `fielddata` parameter. | ||
|
||
[[field-data-filtering]] | ||
==== `fielddata_frequency_filter` | ||
|
||
Fielddata filtering can be used to reduce the number of terms loaded into | ||
memory, and thus reduce memory usage. Terms can be filtered by _frequency_: | ||
|
||
The frequency filter allows you to only load terms whose document frequency falls | ||
between a `min` and `max` value, which can be expressed an absolute | ||
number (when the number is bigger than 1.0) or as a percentage | ||
(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated | ||
*per segment*. Percentages are based on the number of docs which have a | ||
value for the field, as opposed to all docs in the segment. | ||
|
||
Small segments can be excluded completely by specifying the minimum | ||
number of docs that the segment should contain with `min_segment_size`: | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
PUT my-index-000001 | ||
{ | ||
"mappings": { | ||
"properties": { | ||
"tag": { | ||
"type": "text", | ||
"fielddata": true, | ||
"fielddata_frequency_filter": { | ||
"min": 0.001, | ||
"max": 0.1, | ||
"min_segment_size": 500 | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments to make the language more precise: