Infer and cache date field format instead of re-parsing it for every document #4558

itiyama · 2022-09-20T03:27:22Z

The date field for the default format uses high CPU during parsing. A huge portion of date formatting time(close to 7.12% of CPU time in profiles) goes into parsing, which generally happens when the date format is optional for certain segments. Our customers don’t often set the date parser, but rely on the unoptimized default one. When I changed the date parsing format to a strict one for the same data set, the indexing throughput increased by 8%.

For logs, the date format does not change across different log lines. Hence, it is pretty inefficient to compute the date format for every single document. For such users, we could infer and set a stricter date format after parsing a few documents.

Additionally, 7% CPU seems too high just for date parsing. Maybe Java formatter has improved since the time I ran these tests. CPU profile shows that the most time goes into parsing the optional segments for the date.

Solutions?

We should definitely improve our documentation to clearly call out that the date mapping should be set to a stricter format if known well in advance.
Infer the date time parsing format when it is set top optional and re-use it across requests?

dblock · 2022-09-20T20:44:25Z

I wonder whether we can leverage the fact that most of the time all documents have the same date format. Maybe the date parser code can cache the date format and attempt to reuse it, only falling back to re-computing the date format when that fails?

itiyama · 2022-09-20T21:47:22Z

Yes, we should cache it.

CaptainDredge · 2023-07-03T06:15:20Z

Started working on this issue, will share baseline benchmarking numbers soon to highlight the differences in cpu for different datetime field formats

CaptainDredge · 2023-07-26T21:47:59Z

Microbenchmarks

Experiment(datetime format)	Data	Average Time(ns/op)	Std Dev
epoch_millis	123456789	915.87	2.613
strict_date_optional_time\|\|epoch_millis	123456789	1141.29	68.7
strict_date_optional_time	"2022-04-05T22:00:12Z"	3019.366	40.168
yyyy-MM-ddThh:mm:ssZ	"2022-04-05T22:00:12Z"	669.83	13.83
strict_date_time_no_millis	"2022-04-05T22:00:12Z"	1155.38	30.03

CPU % Diff Matrix:

	strict_date_time_no_millis	strict_date_optional_time	yyyy-MM-ddThh:mm:ssZ
Baseline
strict_date_time_no_millis	NA	$${\color{red}-161}$$	$${\color{green}42}$$
strict_date_optional_time	$${\color{green}61}$$	NA	$${\color{green}71}$$
yyyy-MM-ddThh:mm:ssZ	$${\color{red}-72}$$	$${\color{red}-350}$$	NA

Grab workload benchmark (Grab is synthetic data generated, osb-benchmark didn't have the workload for testing different data time mappings except http_logs which has a datetime type fallback where I didn't noticed any significant improvement)

We found no significant differences in search performance of both the formats

CaptainDredge · 2023-07-27T10:51:36Z

With respect to the implementation for this issue, We've couple of approaches for caching datetime field

Cache it on a node level, if last X datetime parsing continuosly suceeded for a specific datetime field type on the node then we'll cache that type and for each document parsing we'll try the cached datetime field first given that the field mapping does contain the cached datetime field. This will be a micro optimization on a per node basis. It may lead to not honoring the order in which the customer defined the datetime field formatters. Also, any caching maintained will get reset on a node restart.
Cache it on a shard level, on each data node we maintain a mapping of shard to cached datetime field format. The criteria for caching will be similar i.e. for last X datetime parsing request on a particular shard succeeds for a specific datetime format then that format will be cached for the particular shard to be tried first for subsequent doc parsing request on the shard. This may also lead to opensearch not honoring the order of the datetime field formatters provided by the user. Also, any caching maintained will get reset on a node restart. This may also lead to different parsing flow on different shards of the same index, so it may also address the shard level hotspots receiving different type of datetime fields.
Cache the datetime field format on an index level, after each successful date time parsing we update the index metadata datetime field mapping list with a reordered version containing the last used dateformatter as the first element so that it'll be tried out first for the subsequent indexing requests. This may also lead to us not honoring the order of formatters provided by the user in the datetime field mapping. The caching will not get reset on node restart and will be uniform across the shards for the particular datetime field parsing.
In case where user has provided multiple optional datetime formats but a stricter datetime format suffices for last X datetime field parsing. We may choose to override the datetime format provided by the user to a stricter more optimized version and cache it on an index level for further use. This will lead to not honoring the list of datetime formats provided by the user and therefore customer expectations need to aligned on this.

Once we implement caching, a quick win will be to add a stricter format like strict_no_millis to the default formatter as one of the formatter so that the overhead of strict_date_optional_time will be minimized if the datetime fields conforms with the strict format at runtime

@Prabs @tharejas @mgodwan please provide your thoughts on this

itiyama added enhancement Enhancement or improvement to existing feature or request untriaged labels Sep 20, 2022

tlfeng removed the untriaged label Sep 20, 2022

shwetathareja assigned CaptainDredge Jul 3, 2023

CaptainDredge mentioned this issue Aug 28, 2023

Added performance improvement for datetime field parsing #9567

Merged

6 tasks

dblock closed this as completed in #9567 Oct 5, 2023

This was referenced Oct 5, 2023

Race condition fix for datetime optimization #10385

Merged

[Backport 2.x] Performance Improvement for Datetime formats to 2.x #10448

Merged

[Backport 2.11] Performance Improvement for Datetime formats to 2.11 #10453

Closed

reta reopened this Oct 13, 2023

reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.12.0 Issues and PRs related to version 2.12.0 labels Oct 13, 2023

github-actions bot added the untriaged label Oct 13, 2023

reta closed this as completed Oct 19, 2023

reta mentioned this issue Nov 13, 2023

[BUG] Probably performance slowdown for indexing due to date string parsing #11177

Closed

BrewTestBot mentioned this issue Feb 21, 2024

opensearch 2.12.0 Homebrew/homebrew-core#163463

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infer and cache date field format instead of re-parsing it for every document #4558

Infer and cache date field format instead of re-parsing it for every document #4558

itiyama commented Sep 20, 2022

dblock commented Sep 20, 2022

itiyama commented Sep 20, 2022

CaptainDredge commented Jul 3, 2023

CaptainDredge commented Jul 26, 2023

CaptainDredge commented Jul 27, 2023 •

edited

Loading

Infer and cache date field format instead of re-parsing it for every document #4558

Infer and cache date field format instead of re-parsing it for every document #4558

Comments

itiyama commented Sep 20, 2022

dblock commented Sep 20, 2022

itiyama commented Sep 20, 2022

CaptainDredge commented Jul 3, 2023

CaptainDredge commented Jul 26, 2023

CaptainDredge commented Jul 27, 2023 • edited Loading

CaptainDredge commented Jul 27, 2023 •

edited

Loading