Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

User job write too much logs will cause disk pressure #4694

Closed
Binyang2014 opened this issue Jul 10, 2020 · 8 comments · Fixed by #4702
Closed

User job write too much logs will cause disk pressure #4694

Binyang2014 opened this issue Jul 10, 2020 · 8 comments · Fixed by #4702
Assignees
Labels

Comments

@Binyang2014
Copy link
Contributor

PAI will keep user job log under /var/log/pai.
If user job write too much logs, it will cause machine disk pressure.

We need to:

  1. Make the log path configurable, then we can store user log into a large disk
  2. Investigate how to kill such offence job
@Binyang2014 Binyang2014 self-assigned this Jul 10, 2020
@fanyangCS
Copy link
Contributor

relate to #3765 and #3340

@fanyangCS
Copy link
Contributor

@Binyang2014, please firstly make sure the OpenPAI service pods are of higher QoS class than job pods. In some case the service pods get evicted.

@Binyang2014
Copy link
Contributor Author

@Binyang2014, please firstly make sure the OpenPAI service pods are of higher QoS class than job pods. In some case the service pods get evicted.

We may need to mark these pods as critical to achieve this: https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/

@Binyang2014
Copy link
Contributor Author

Checked the QoS class. Currently, job-exporter, node-exporter qos are Burstable, log-manager qos class is BestEffort and pod user job qos class is Guaranteed.
For k8s node eviction, BestEffort will be first evicted. Guaranteed and BestEffort pods whose usage is beneath requests are evicted last. (https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/?spm=a2c65.11461447.0.0.6a497eafo7oGQp#evicting-end-user-pods)

For Guaranteed and BestEffort pods which resource usage is not exceed their requests, will ordered by pod priority.

For this case, the resource is disk, and all pod don't claim the requests for the disk. So the eviction order will rank by pod priority, then resource usage.
We can get the eviction order from the log:

eviction manager: must evict pod(s) to reclaim ephemeral-storage
eviction_manager.go:362] eviction manager: pods ranked for eviction: user-job-pod, job-exporter-zgf4j_default(e9e0a4a5-660a-493b-ac4b-a95f8977867a),
nginx-proxy-prodk80bg000012_kube-system(6d20234d7a8eda76fb23d52d6f743b77),
log-manager-ds-9tflc_default(67956c0d-ba89-4aa9-a14e-ed2fbbcd915e),
k8s-host-device-plugin-daemonset-kqh67_kube-system(8853e257-8f09-46ca-b372-2632cb94eea5),
blobfuse-flexvol-installer-vrqdg_kube-system(dd5a903a-9a38-4f27-b3d8-00bed955c9e9),
node-exporter-2ks4s_default(326bb034-a011-4960-baef-6eb6fa7c9f24),
kube-proxy-l7gzb_kube-system(82351f89-30b4-4a20-b679-0a15f213b999),
nvidia-device-plugin-daemonset-j2f8z_kube-system(9d98eaed-cc87-43c4-a514-82c147833843)

Since user job usually consume more disk, it will be evicted first. But we use hostPath for the log folder, evict user job will not solve the problem. Then, kubelet continue to evict pai service pod.

To leverage k8s eviction policy to avoid disk pressure, we'd better not store job logs in each host.
It's better to use default k8s log mechanism, then use fluend store log into centralized storage-server. (https://kubernetes.io/docs/concepts/cluster-administration/logging/#cluster-level-logging-architectures)

@Binyang2014
Copy link
Contributor Author

And our log-manger is mis-configured. It not rotate the log according to size, but according to time. After reconfigure the log-manager and fix some bugs, this issue can be mitigated.

@fanyangCS
Copy link
Contributor

@Binyang2014
Can we update the deployment script to avoid future misconfiguration? (At least we should update the document)

Moreover, it seems we need the following:

  1. Set the QoS class of job pod to the "lowest" (BestEffort); Set the QoS class of log-manager to Burstable (same as other OpenPAI services)
  2. To avoid mis-eviction of the wrong job pods, we still need a watchdog to kill the offending job pod
  3. Leverage k8s log mechanism

Anything more?

@Binyang2014
Copy link
Contributor Author

Since the QoS class is assigned by k8s according to pod resource request/limit. We can't change the job pod QoS class. (Change the resource request/limit value may affect scheduler). We can keep job pod QoS class as Guaranteed, and use pod priority to control the eviction rank.

We can do following:

  1. Set the QoS class of log-manager to Burstable, and fix some config error of logrotate.
  2. Give pai service pod higher priority than job pod, make sure them will not be evicted before job pod.
  3. A watch dog to kill the offending job pod / Leverage k8s log mechanism. If we leverage k8s log mechanism, k8s will help us kill the offending job.

@Binyang2014
Copy link
Contributor Author

Closed, this issue already fixed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants