Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove horovod upgrade to fix tf perf issue #302

Merged
merged 1 commit into from
May 11, 2022

Conversation

zehuanw
Copy link
Contributor

@zehuanw zehuanw commented May 11, 2022

Remove horovod upgrade to solve #301.

We can build the container without error and have verified with SOK DLRM performance.

@zehuanw zehuanw linked an issue May 11, 2022 that may be closed by this pull request
2 tasks
Copy link
Member

@benfred benfred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this fix in @zehuanw !

Should we also remove this line from the docker/training/dockerfile.ctr file?

@jperez999 I see that this horovod install was added by Alberto in #178 . Do you remember why this was there?

@jperez999
Copy link
Collaborator

No I do not recall why that was there... but we can remove. That is coming from dlfw container.

@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #302 of commit 4bd5210188978d37ea22e1a92794cd532a8c4413, no merge conflicts.
Running as SYSTEM
Setting status of 4bd5210188978d37ea22e1a92794cd532a8c4413 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/80/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/302/*:refs/remotes/origin/pr/302/* # timeout=10
 > git rev-parse 4bd5210188978d37ea22e1a92794cd532a8c4413^{commit} # timeout=10
Checking out Revision 4bd5210188978d37ea22e1a92794cd532a8c4413 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 4bd5210188978d37ea22e1a92794cd532a8c4413 # timeout=10
Commit message: "remove horovod upgrade to fix tf perf issue"
 > git rev-list --no-walk a2237e607c24d2027d7be5eefc1204bc6dc0fac5 # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins7813602594619115451.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1 item

tests/unit/test_version.py . [100%]

============================== 1 passed in 0.01s ===============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://github.com/gitapi/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins9101855656590359336.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SOK performance drop on nvcr.io/nvidia/tensorfow
4 participants