Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(colab notebook) Train DALLE-pytorch on C@H #291

Open
afiaka87 opened this issue Jun 8, 2021 · 11 comments
Open

(colab notebook) Train DALLE-pytorch on C@H #291

afiaka87 opened this issue Jun 8, 2021 · 11 comments

Comments

@afiaka87
Copy link
Contributor

afiaka87 commented Jun 8, 2021

https://gist.github.com/afiaka87/b29213684a1dd633df20cab49d05209d

If there are any bugs - please make a comment below. When in doubt; restart your kernel. Tends to fix things a lot.

@johngore123
Copy link

Hi i messaged you on discord but you seemed to be busy anyways i have an problem where its stuck at 'Time to load sparse_attn op:'. no matter what params i use it used to work now it takes 10 minutes+ is this a bug or a simple mistake from me?

@johngore123
Copy link

And btw I'm valteralfred. @afiaka87

@afiaka87
Copy link
Contributor Author

afiaka87 commented Jun 9, 2021

And btw I'm valteralfred. @afiaka87

Hey! I've seen this bug before I think. You need to delete a folder containing the precompiled pytorch extensions. I want to say it's in the /root/.cache/torch_extensions directory but am on mobile and can't check currently.

@johngore123
Copy link

Thanks ill try that if it does not work ill try something else.

@johngore123
Copy link

It seems to have fixed it self! thanks for the help.

@johngore123
Copy link

I think the cache got cleaned

@afiaka87
Copy link
Contributor Author

Anyone coming here from the notebook - I'm not really on the discord as often as I should be. File issues with the notebook here if you can or I'm not as likely to see them.

I believe the issue here is that pytorch or deepspeed or something gets stuck trying to compile an extension. When in doubt; restart the kernel on your notebook. You won't lose your instance - it'll just clear any local state you have currently. Then you can re-run the cell you were on before; no need to re-run the setup cells.

@SadRebel1000
Copy link

SadRebel1000 commented Jul 23, 2021

Hi there,

Trying the collar notebook for the first time. It gets stuck at the installation of NVIDIA apex. It seems that the --disable-pip-version-check doesn't seem to work?

Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-ikzk4nf6/setup.py", line 171, in <module> check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
File "/tmp/pip-req-build-ikzk4nf6/setup.py", line 106, in check_cuda_torch_binary_vs_bare_metal  "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 10.2.
In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).
Running setup.py install for apex ... error
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = 
'"'"'/tmp/pip-req-build-ikzk4nf6/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ikzk4nf6/setup.py'"'"';f = getattr(tokenize, 
'"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = 
f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record 
/tmp/pip-record-q6hys36y/install-record.txt --single-version-externally-managed --compile --install-headers 
/usr/local/include/python3.7/apex Check the logs for full command output.
Exception information:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/req/req_install.py", line 825, in install
    req_description=str(self.req),
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/operations/install/legacy.py", line 81, in install
    raise LegacyInstallFailure
pip._internal.operations.install.legacy.LegacyInstallFailure

@afiaka87
Copy link
Contributor Author

I updated the colab notebook recently to train with the crawling @ home dataset. Hopefully fixed some of these issues.

@afiaka87 afiaka87 changed the title (colab notebook) Train DALLE-pytorch with DeepSpeed and Automatic Mixed Precision (colab notebook) Train DALLE-pytorch Sep 21, 2021
@afiaka87 afiaka87 changed the title (colab notebook) Train DALLE-pytorch (colab notebook) Train DALLE-pytorch on C@H Sep 21, 2021
@Stomachache007
Copy link

@afiaka87 Hi, thanks for your sharing. I am using the afiaka dalle generation colab.https://colab.research.google.com/drive/11V2xw1eLPfZvzW8UQyTUhqCEU71w6Pr4?usp=sharing#scrollTo=682c5804-5f97-469f-8cf1-1cc8356591b8. Got several bugs I don't know how to fix:
File "/usr/local/lib/python3.7/dist-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 127, in forward
assert query.dtype == torch.half, "sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support"
AssertionError: sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support
Finished generating images, attempting to display results...

Also find related issue here: robvanvolt/DALLE-models#13 but no one fixed yet.

@afiaka87
Copy link
Contributor Author

@afiaka87 Hi, thanks for your sharing. I am using the afiaka dalle generation colab.https://colab.research.google.com/drive/11V2xw1eLPfZvzW8UQyTUhqCEU71w6Pr4?usp=sharing#scrollTo=682c5804-5f97-469f-8cf1-1cc8356591b8. Got several bugs I don't know how to fix:
File "/usr/local/lib/python3.7/dist-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 127, in forward
assert query.dtype == torch.half, "sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support"
AssertionError: sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support
Finished generating images, attempting to display results...

Also find related issue here: robvanvolt/DALLE-models#13 but no one fixed yet.

This has to do with Deep speed dropping support for a lot of GPUs with its sparse attention cuda code. I don't believe they are likely to work soon regrettably as I can no longer run them locally either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants