This repository has been archived by the owner on Nov 16, 2023. It is now read-only.
Releases: microsoft/frameworkcontroller
Releases · microsoft/frameworkcontroller
v1.0.1
v1.0.0
v0.9.0
v0.8.0
v0.7.0
v0.6.0
v0.5.0
New Feature
- Support large scale Framework by LargeFrameworkCompression (#44)
- Add PodNodeName to help track failures on node before PodIP is available (#45)
Full Commit History Since Previous Release
v0.4.0
New Feature
- Add example to leverage HivedScheduler to achieve GPU Topology-Aware, Multi-Tenant, Priority and Gang Scheduling (#34)
- Support to expose Framework and Pod history snapshots to external systems (#31)
- Support to classify and summarize Pod failures (#41)
- Support to tune Framework Consistency vs Availability (#43)
This helps to avoid the Pod is stuck in deleting forever, such as if its Node is down forever. - Support Stop Framework (#24)
This helps to stop the Framework without deleting it. - Still sync Task after FrameworkAttempt Completing (#27)
This helps to make sure all Tasks in the Framework are updated to the right completed status when the whole FrameworkAttempt is completed. - Support FrameworkCompletedRetainSec (#37)
This helps to automatically delete the Framework after it is completed for a long time, to free ETCD space. - Add FrameworkAttemptPreparing State (#12)
This helps to distinguish if there is at least one Task of current attempt has ever entered TaskAttemptRunning state. If not, it is FrameworkAttemptPreparing instead of FrameworkAttemptRunning anymore. - Redefine FrameworkAttemptRunning and Record attempt running start time (#35)
This helps to measure Framework and Task pure running duration. - Support Pod Template Placeholders (#21)
Bug Fix
- Fix TaskCompleted may transition to TaskAttemptCompleted (#10)
- Fix fExpectedStatusInfos map race condition (#18)
Misc
- Upgrade to kubernetes-1.14.2 (#16)
- Remove Internal and External CompletionTypeAttribute (#22 #41)
This is because FrameworkController does not need to aware it, so leave the freedom to controller wrapper - Upgrade to golang 1.12.6 (#29)
- Switch to klog (#30)
Full Commit History Since Previous Release
v0.3.0
v0.2.0
Add Distributed TensorFlow Training Example
Feature
- Support both GPU and CPU Distributed Training
- Automatically clean up PS when the whole FrameworkAttempt is completed
- No need to adjust existing TensorFlow image
- No need to setup Kubernetes DNS and Kubernetes Service
- Common Feature
Prerequisite
- Need to setup Kubernetes GPU, if you need GPU Training
- Need to setup Kubernetes Cluster-Level Logging, if you need to persist and expose the log for deleted Pod
Quick Start
Support FrameworkBarrier for GangExecution
Feature
It is usually used as the InitContainer to provide a simple way to
- Do Gang Execution without resource deadlock
- Start the AppContainers in the Pod only after its PodUID is persisted by FrameworkController
- Inject peer-to-peer service discovery information into the AppContainers