(colab notebook) Train DALLE-pytorch on C@H #291

afiaka87 · 2021-06-08T23:00:00Z

https://gist.github.com/afiaka87/b29213684a1dd633df20cab49d05209d

If there are any bugs - please make a comment below. When in doubt; restart your kernel. Tends to fix things a lot.

johngore123 · 2021-06-09T20:12:28Z

Hi i messaged you on discord but you seemed to be busy anyways i have an problem where its stuck at 'Time to load sparse_attn op:'. no matter what params i use it used to work now it takes 10 minutes+ is this a bug or a simple mistake from me?

johngore123 · 2021-06-09T20:13:23Z

And btw I'm valteralfred. @afiaka87

afiaka87 · 2021-06-09T20:29:49Z

And btw I'm valteralfred. @afiaka87

Hey! I've seen this bug before I think. You need to delete a folder containing the precompiled pytorch extensions. I want to say it's in the /root/.cache/torch_extensions directory but am on mobile and can't check currently.

johngore123 · 2021-06-09T20:31:28Z

Thanks ill try that if it does not work ill try something else.

johngore123 · 2021-06-09T21:22:13Z

It seems to have fixed it self! thanks for the help.

johngore123 · 2021-06-09T21:22:58Z

I think the cache got cleaned

afiaka87 · 2021-06-11T21:01:16Z

Anyone coming here from the notebook - I'm not really on the discord as often as I should be. File issues with the notebook here if you can or I'm not as likely to see them.

I believe the issue here is that pytorch or deepspeed or something gets stuck trying to compile an extension. When in doubt; restart the kernel on your notebook. You won't lose your instance - it'll just clear any local state you have currently. Then you can re-run the cell you were on before; no need to re-run the setup cells.

SadRebel1000 · 2021-07-23T10:13:07Z

Hi there,

Trying the collar notebook for the first time. It gets stuck at the installation of NVIDIA apex. It seems that the --disable-pip-version-check doesn't seem to work?

Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-ikzk4nf6/setup.py", line 171, in <module> check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
File "/tmp/pip-req-build-ikzk4nf6/setup.py", line 106, in check_cuda_torch_binary_vs_bare_metal  "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 10.2.
In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).
Running setup.py install for apex ... error
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = 
'"'"'/tmp/pip-req-build-ikzk4nf6/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ikzk4nf6/setup.py'"'"';f = getattr(tokenize, 
'"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = 
f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record 
/tmp/pip-record-q6hys36y/install-record.txt --single-version-externally-managed --compile --install-headers 
/usr/local/include/python3.7/apex Check the logs for full command output.
Exception information:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/req/req_install.py", line 825, in install
    req_description=str(self.req),
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/operations/install/legacy.py", line 81, in install
    raise LegacyInstallFailure
pip._internal.operations.install.legacy.LegacyInstallFailure

afiaka87 · 2021-09-21T19:29:57Z

I updated the colab notebook recently to train with the crawling @ home dataset. Hopefully fixed some of these issues.

Stomachache007 · 2021-12-17T06:54:03Z

@afiaka87 Hi, thanks for your sharing. I am using the afiaka dalle generation colab.https://colab.research.google.com/drive/11V2xw1eLPfZvzW8UQyTUhqCEU71w6Pr4?usp=sharing#scrollTo=682c5804-5f97-469f-8cf1-1cc8356591b8. Got several bugs I don't know how to fix:
File "/usr/local/lib/python3.7/dist-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 127, in forward
assert query.dtype == torch.half, "sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support"
AssertionError: sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support
Finished generating images, attempting to display results...

Also find related issue here: robvanvolt/DALLE-models#13 but no one fixed yet.

afiaka87 · 2021-12-17T10:54:19Z

@afiaka87 Hi, thanks for your sharing. I am using the afiaka dalle generation colab.https://colab.research.google.com/drive/11V2xw1eLPfZvzW8UQyTUhqCEU71w6Pr4?usp=sharing#scrollTo=682c5804-5f97-469f-8cf1-1cc8356591b8. Got several bugs I don't know how to fix:
File "/usr/local/lib/python3.7/dist-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 127, in forward
assert query.dtype == torch.half, "sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support"
AssertionError: sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support
Finished generating images, attempting to display results...

Also find related issue here: robvanvolt/DALLE-models#13 but no one fixed yet.

This has to do with Deep speed dropping support for a lot of GPUs with its sparse attention cuda code. I don't believe they are likely to work soon regrettably as I can no longer run them locally either.

afiaka87 changed the title ~~(colab notebook) Train DALLE-pytorch with DeepSpeed and Automatic Mixed Precision~~ (colab notebook) Train DALLE-pytorch Sep 21, 2021

afiaka87 changed the title ~~(colab notebook) Train DALLE-pytorch~~ (colab notebook) Train DALLE-pytorch on C@H Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(colab notebook) Train DALLE-pytorch on C@H #291

(colab notebook) Train DALLE-pytorch on C@H #291

afiaka87 commented Jun 8, 2021 •

edited

Loading

johngore123 commented Jun 9, 2021

johngore123 commented Jun 9, 2021

afiaka87 commented Jun 9, 2021 •

edited

Loading

johngore123 commented Jun 9, 2021

johngore123 commented Jun 9, 2021

johngore123 commented Jun 9, 2021

afiaka87 commented Jun 11, 2021

SadRebel1000 commented Jul 23, 2021 •

edited

Loading

afiaka87 commented Sep 21, 2021

Stomachache007 commented Dec 17, 2021

afiaka87 commented Dec 17, 2021

(colab notebook) Train DALLE-pytorch on C@H #291

(colab notebook) Train DALLE-pytorch on C@H #291

Comments

afiaka87 commented Jun 8, 2021 • edited Loading

johngore123 commented Jun 9, 2021

johngore123 commented Jun 9, 2021

afiaka87 commented Jun 9, 2021 • edited Loading

johngore123 commented Jun 9, 2021

johngore123 commented Jun 9, 2021

johngore123 commented Jun 9, 2021

afiaka87 commented Jun 11, 2021

SadRebel1000 commented Jul 23, 2021 • edited Loading

afiaka87 commented Sep 21, 2021

Stomachache007 commented Dec 17, 2021

afiaka87 commented Dec 17, 2021

afiaka87 commented Jun 8, 2021 •

edited

Loading

afiaka87 commented Jun 9, 2021 •

edited

Loading

SadRebel1000 commented Jul 23, 2021 •

edited

Loading