Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] configuring offload_* param sections #1005

Open
stas00 opened this issue Apr 24, 2021 · 3 comments
Open

[doc] configuring offload_* param sections #1005

stas00 opened this issue Apr 24, 2021 · 3 comments

Comments

@stas00
Copy link
Collaborator

stas00 commented Apr 24, 2021

#998 tackles the aio param section, but we still have no user guide for the new "offload_optimizer" and "offload_param" sections. We have:

            "offload_optimizer": {​​​​​
                "device": "nvme",
                "nvme_path": "/local_nvme",
                "pin_memory": true,
                "buffer_count": 4,
                "fast_init": false
            }​​​​​,
            "offload_param": {​​​​​
                "device": "nvme",
                "nvme_path": "/local_nvme",
                "pin_memory": true,
                "buffer_count": 5,
                "buffer_size": 1e8,
                "max_in_cpu": 1e9
            }​​​​​

other than device, nvme_path and pin_memory which are pretty obvious, the rest have super-terse descriptions and a user will have no idea how to configure those. Let's write a guide to how these values should be chosen.

I copied the descriptions and defaults that already exist and tried to ask the right questions, so if you could answer those I think that would be a great start.

Thank you!

Optimizer

  • buffer_count: default 4: Number of buffers in buffer pool for optimizer state offloading to NVMe. This should be at least the number of states maintained per parameter by the optimizer. For example, Adam optimizer has 4 states (parameter, gradient, momentum, and variance)

Q: why "at least" - is it more efficient to have it bigger?
Q: what's the impact on memory footprint (CPU/NVMe)

  • fast_init: default false. Enable fast optimizer initialization when offloading to NVMe.

Q: why is it false by default?

Param

  • buffer_count: default 5: Number of buffers in buffer pool for parameter offloading to NVMe.

Q: why 5, what are the correlations to other params?

  • buffer_size: default 1e8: Size of buffers in buffer pool for parameter offloading to NVMe.

Q: how do we get to this number and how it correlates with other config params?
Q: what's the impact on memory footprint (CPU/NVMe)

  • max_in_cpu: default 1e9: Number of parameter elements to maintain in CPU memory when offloading to NVMe is enabled

Q: how do we get to this number and how it correlates with other config params?
Q: what's the impact on memory footprint (CPU/NVMe)

@fahadh4ilyas
Copy link

Is there any way to set those params? I kept getting OOM but I don't understand the parameter in config. The documentation also didn't help much.

@koesnow
Copy link

koesnow commented May 25, 2023

I am also looking for guidelines to set those parmas. They do seem to give meaningful impact on my server setup in terms of performance, but setting those values to high kills the whole system.

@xuanyaoming
Copy link

I am also looking for guidelines to set those parmas. They do seem to give meaningful impact on my server setup in terms of performance, but setting those values to high kills the whole system.

Same issue here. Even reading the paper doesn't help at all. Is there a documentation explaining what these params do yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants