Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] RFC - Adding segment download metrics to remotestore stats API #8395

Closed
shourya035 opened this issue Jul 3, 2023 · 0 comments · Fixed by #8718
Closed

[Remote Store] RFC - Adding segment download metrics to remotestore stats API #8395

shourya035 opened this issue Jul 3, 2023 · 0 comments · Fixed by #8718
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage:Durability Issues and PRs related to the durability framework Storage Issues and PRs relating to data and metadata storage

Comments

@shourya035
Copy link
Member

Overview

As of today the _remotestore/stats API only shows the segment upload stats. We are planning on integrating segment download stats to this API also. That would provide the end user with vital stats related to segment downloads and troubleshoot slow index recovery times.

The _remotestore/stats API only takes into account the primary shard copies for an index. This is because, the primary shard copy is taking part in all the segment uploads to the remote store. The replica shard copies on the other hand would only download segments from the remote store as and when it is required.


Metrics to be added:

We are suggesting the following metrics to be added for tracking segment downloads from the remote store:

  • last_download_timestamp : Last timestamp in milliseconds when the download from remote store succeeded
  • total_files_downloaded : Would be available in started, succeeded and failed statistics. This would track the total number of files that have been downloaded from the remote store.
  • total_files_downloaded_in_bytes : Would be available in started, succeeded and failed statistics. This would track the total size of files downloaded from the remote store
  • download_size_in_bytes : Would be available in moving_avg and last_successful stat. Would track the last successfully downloaded segment size and average size of the downloaded payload from the remote store
  • download_speed_in_bytes_per_sec : Would be available in moving_avg stat. Would track the average speed of downloads from the remote store
  • download_latency_in_millis : Would be available in moving_avg stat. Would be used to track time taken for downloads from the remote store to complete

We are also proposing a change in the API output for remotestore stats for this. The new API output would be like the sample mentioned below:

Path:

GET /_remotestore/stats/<index>

Response:

{
    "indices": {
        "index-name": {
            "shards": {
                "0": [{
                        "routing": {
                            "state": "STARTED",
                            "primary": true,
                            "node": "a7gVvK0hRo69f_Zo--niDw"
                        },
                        "segment": {
                            "download": {
                                "last_download_timestamp": "123456789012",
                                "total_file_downloads": {
                                    "started": 0,
                                    "succeeded": 0,
                                    "failed": 0
                                },
                                "total_file_downloads_in_bytes": {
                                    "started": 0,
                                    "succeeded": 0,
                                    "failed": 0
                                },
                                "download_size_in_bytes": {
                                    "moving_avg": 0
                                },
                                "download_speed_in_bytes": {
                                    "moving_avg": 0
                                },
                                "download_latency_in_millis": {
                                    "moving_avg": 0
                                }
                            },
                            "upload": {}
                        },
                        "translog": {
                            "download": {},
                            "upload": {}
                        }
                    },
                    {
                        "routing": {
                            "state": "STARTED",
                            "primary": false,
                            "node": "xybzjkhsdakkl--as"
                        },
                        "segment": {
                            "download": {},
                            "upload": {
                                "last_upload_timestamp": "123456789012",
                                "total_file_uploads": {
                                    "started": 10,
                                    "succeeded": 10,
                                    "failed": 0
                                },
                                "total_file_uploads_in_bytes": {
                                    "started": 10,
                                    "succeeded": 10,
                                    "failed": 0
                                },
                                "uploads_size_in_bytes": {
                                    "moving_avg": 10
                                },
                                "uploads_speed_in_bytes": {
                                    "moving_avg": 10
                                },
                                "upload_latency_in_millis": {
                                    "moving_avg": 10
                                }
                            }
                        },
                        "translog": {
                            "download": {},
                            "upload": {}
                        }
                    }
                ],
                "1":[{
                    ...
                }],
                "2": [{
                    ...
                }],
                ...
            }
        }
    }
}

and along the same lines:

Path:

GET /_remotestore/stats/<index>/<shardId>

Response:

{
    "indices": {
        "index-name": {
            "shards": {
                "0": [{
                        "routing": {
                            "state": "STARTED",
                            "primary": true,
                            "node": "a7gVvK0hRo69f_Zo--niDw"
                        },
                        "segment": {
                            "download": {
                                "last_download_timestamp": "123456789012",
                                "total_file_downloads": {
                                    "started": 0,
                                    "succeeded": 0,
                                    "failed": 0
                                },
                                "total_file_downloads_in_bytes": {
                                    "started": 0,
                                    "succeeded": 0,
                                    "failed": 0
                                },
                                "download_size_in_bytes": {
                                    "moving_avg": 0
                                },
                                "download_speed_in_bytes": {
                                    "moving_avg": 0
                                },
                                "download_latency_in_millis": {
                                    "moving_avg": 0
                                }
                            },
                            "upload": {}
                        },
                        "translog": {
                            "download": {},
                            "upload": {}
                        }
                    },
                    {
                        "routing": {
                            "state": "STARTED",
                            "primary": false,
                            "node": "xybzjkhsdakkl--as"
                        },
                        "segment": {
                            "download": {},
                            "upload": {
                                "last_upload_timestamp": "123456789012",
                                "total_file_uploads": {
                                    "started": 10,
                                    "succeeded": 10,
                                    "failed": 0
                                },
                                "total_file_uploads_in_bytes": {
                                    "started": 10,
                                    "succeeded": 10,
                                    "failed": 0
                                },
                                "uploads_size_in_bytes": {
                                    "moving_avg": 10
                                },
                                "uploads_speed_in_bytes": {
                                    "moving_avg": 10
                                },
                                "upload_latency_in_millis": {
                                    "moving_avg": 10
                                }
                            }
                        },
                        "translog": {
                            "download": {},
                            "upload": {}
                        }
                    }
                ]
            }
        }
    }
}

This API format change would also accommodate the newer Translog Upload stats metrics that are being proposed on: #8311

@shourya035 shourya035 added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 3, 2023
@Rishikesh1159 Rishikesh1159 added RFC Issues requesting major changes Storage:Durability Issues and PRs related to the durability framework labels Jul 5, 2023
@Bukhtawar Bukhtawar added the Storage Issues and PRs relating to data and metadata storage label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage:Durability Issues and PRs related to the durability framework Storage Issues and PRs relating to data and metadata storage
Projects
None yet
4 participants