Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reserve evaluator host spec in TF_CONFIG cluster, only when in evaluator process #515

Merged
merged 1 commit into from
Mar 8, 2021

Conversation

zuston
Copy link
Member

@zuston zuston commented Mar 8, 2021

What

To fix bug that evaluator will throw exception when in estimator multiworker strategy.

Why

According to the previous PR, in order to be compatible with PS v2, we need to remove the evaluator from the TF_CONFIG cluster.
After testing the estimator (with TF1.x PS strategy), it works well.

However, recently I encountered the use of estimator multiworker strategy under TF1.x, and I found it reported an error. The error is as follows:

I0308 09:19:31.673173 140448615241536 simple_estimator.py:53] TF_CONFIG is {"worker":["node1:22739","node2:19226"],"evaluator":["node3:11265"]}
2021-03-08 09:19:31.674276: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2021-03-08 09:19:31.697051: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2600000000 Hz
2021-03-08 09:19:31.702105: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3cf6610 executing computations on platform Host. Devices:
2021-03-08 09:19:31.702135: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
Traceback (most recent call last):
  File "simple_estimator.py", line 197, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]])
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/lib/python2.7/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/lib/python2.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "simple_estimator.py", line 106, in main
    train_distribute=tf.distribute.experimental.MultiWorkerMirroredStrategy(),
  File "/usr/lib/python2.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 95, in __init__
    communication=communication))
  File "/usr/lib/python2.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 110, in __init__
    self._initialize_strategy(cluster_resolver)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 116, in _initialize_strategy
    self._initialize_multi_worker(cluster_resolver)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/distribute/collective_all_reduce_strategy.py", line 184, in _initialize_multi_worker
    self._num_workers = multi_worker_util.worker_count(cluster_spec, task_type)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/distribute/multi_worker_util.py", line 156, in worker_count
    _validate_cluster_spec(cluster_spec, task_type, task_id=0)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/distribute/multi_worker_util.py", line 74, in _validate_cluster_spec
    raise ValueError("`task_type` %r not found in cluster_spec." % task_type)
ValueError: `task_type` 'evaluator' not found in cluster_spec.

Therefore, I think we should keep the evaluator configuration only when in evaluator process env.

Why is there such a strange problem? I think this is a design problem of the TF estimator and I also can't find some detailed doc.

In this patch, TF_CONFIG will as like following.

For Chief/Woker/PS

TF_CONFIG = {
  "cluster": {
  "ps": ["localhost:port1"],
  "worker": ["localhost:port2"],
  "chief": ["localhost:port3"]
  },
  "task": {
  "type": "chief/ps/worker",
  "index": 0
  }
}

For Evaluator

TF_CONFIG = {
  "cluster": {
  "ps": ["localhost:port1"],
  "worker": ["localhost:port2"],
  "chief": ["localhost:port3"],
 "evaluator": ["localhost:port4"]
  },
  "task": {
  "type": "evaluator",
  "index": 0
  }
}

@oliverhu oliverhu merged commit 06aef04 into tony-framework:master Mar 8, 2021
zuston added a commit to zuston/TonY that referenced this pull request Apr 21, 2021
zuston pushed a commit to zuston/TonY that referenced this pull request Apr 21, 2021
Backport: Reserve evaluator host spec in TF_CONFIG cluster, only when in evaluator process tony-framework#515

See merge request !31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants