Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infer and cache date field format instead of re-parsing it for every document #4558

Closed
itiyama opened this issue Sep 20, 2022 · 5 comments · Fixed by #9567 or #10385
Closed

Infer and cache date field format instead of re-parsing it for every document #4558

itiyama opened this issue Sep 20, 2022 · 5 comments · Fixed by #9567 or #10385
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request untriaged v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@itiyama
Copy link

itiyama commented Sep 20, 2022

The date field for the default format uses high CPU during parsing. A huge portion of date formatting time(close to 7.12% of CPU time in profiles) goes into parsing, which generally happens when the date format is optional for certain segments. Our customers don’t often set the date parser, but rely on the unoptimized default one. When I changed the date parsing format to a strict one for the same data set, the indexing throughput increased by 8%.

For logs, the date format does not change across different log lines. Hence, it is pretty inefficient to compute the date format for every single document. For such users, we could infer and set a stricter date format after parsing a few documents.

Additionally, 7% CPU seems too high just for date parsing. Maybe Java formatter has improved since the time I ran these tests. CPU profile shows that the most time goes into parsing the optional segments for the date.

Solutions?

  1. We should definitely improve our documentation to clearly call out that the date mapping should be set to a stricter format if known well in advance.
  2. Infer the date time parsing format when it is set top optional and re-use it across requests?
@itiyama itiyama added enhancement Enhancement or improvement to existing feature or request untriaged labels Sep 20, 2022
@tlfeng tlfeng removed the untriaged label Sep 20, 2022
@dblock
Copy link
Member

dblock commented Sep 20, 2022

I wonder whether we can leverage the fact that most of the time all documents have the same date format. Maybe the date parser code can cache the date format and attempt to reuse it, only falling back to re-computing the date format when that fails?

@itiyama
Copy link
Author

itiyama commented Sep 20, 2022

Yes, we should cache it.

@CaptainDredge
Copy link
Contributor

Started working on this issue, will share baseline benchmarking numbers soon to highlight the differences in cpu for different datetime field formats

@CaptainDredge
Copy link
Contributor

Microbenchmarks

Experiment(datetime format) Data Average Time(ns/op) Std Dev
epoch_millis 123456789 915.87 2.613
strict_date_optional_time||epoch_millis 123456789 1141.29 68.7
strict_date_optional_time "2022-04-05T22:00:12Z" 3019.366 40.168
yyyy-MM-ddThh:mm:ssZ "2022-04-05T22:00:12Z" 669.83 13.83
strict_date_time_no_millis "2022-04-05T22:00:12Z" 1155.38 30.03

CPU % Diff Matrix:

  Candidate strict_date_time_no_millis strict_date_optional_time yyyy-MM-ddThh:mm:ssZ
Baseline        
strict_date_time_no_millis   NA $${\color{red}-161}$$ $${\color{green}42}$$
strict_date_optional_time   $${\color{green}61}$$ NA $${\color{green}71}$$
yyyy-MM-ddThh:mm:ssZ   $${\color{red}-72}$$ $${\color{red}-350}$$ NA

Grab workload benchmark (Grab is synthetic data generated, osb-benchmark didn't have the workload for testing different data time mappings except http_logs which has a datetime type fallback where I didn't noticed any significant improvement)

We found no significant differences in search performance of both the formats

@CaptainDredge
Copy link
Contributor

CaptainDredge commented Jul 27, 2023

With respect to the implementation for this issue, We've couple of approaches for caching datetime field

  • Cache it on a node level, if last X datetime parsing continuosly suceeded for a specific datetime field type on the node then we'll cache that type and for each document parsing we'll try the cached datetime field first given that the field mapping does contain the cached datetime field. This will be a micro optimization on a per node basis. It may lead to not honoring the order in which the customer defined the datetime field formatters. Also, any caching maintained will get reset on a node restart.
  • Cache it on a shard level, on each data node we maintain a mapping of shard to cached datetime field format. The criteria for caching will be similar i.e. for last X datetime parsing request on a particular shard succeeds for a specific datetime format then that format will be cached for the particular shard to be tried first for subsequent doc parsing request on the shard. This may also lead to opensearch not honoring the order of the datetime field formatters provided by the user. Also, any caching maintained will get reset on a node restart. This may also lead to different parsing flow on different shards of the same index, so it may also address the shard level hotspots receiving different type of datetime fields.
  • Cache the datetime field format on an index level, after each successful date time parsing we update the index metadata datetime field mapping list with a reordered version containing the last used dateformatter as the first element so that it'll be tried out first for the subsequent indexing requests. This may also lead to us not honoring the order of formatters provided by the user in the datetime field mapping. The caching will not get reset on node restart and will be uniform across the shards for the particular datetime field parsing.
  • In case where user has provided multiple optional datetime formats but a stricter datetime format suffices for last X datetime field parsing. We may choose to override the datetime format provided by the user to a stricter more optimized version and cache it on an index level for further use. This will lead to not honoring the list of datetime formats provided by the user and therefore customer expectations need to aligned on this.

Once we implement caching, a quick win will be to add a stricter format like strict_no_millis to the default formatter as one of the formatter so that the overhead of strict_date_optional_time will be minimized if the datetime fields conforms with the strict format at runtime

@Prabs @tharejas @mgodwan please provide your thoughts on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request untriaged v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
5 participants