Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError and speed loss with opt_level = O1, O2 or O3 #373

Open
adrienchaton opened this issue Jun 23, 2019 · 28 comments
Open

RuntimeError and speed loss with opt_level = O1, O2 or O3 #373

adrienchaton opened this issue Jun 23, 2019 · 28 comments

Comments

@adrienchaton
Copy link

Hello,

I discovered your apex tools for integrating mixed precision and FP16 training in pytorch, which is a great idea to develop ! Our servers are mainly equipped with TITAN V cards hence I was really looking forward to trying them out at their fastest. Software versions are pytorch 1.1.0, cuda 9.1.85 and cudnn 7.1.3. As I never tried this before, I used the more straight-forward apex.amp to compare FP32 training with opt_level = O1 or O2 or O3 (the last with keep_batchnorm_fp32=True as my models use batchnorm).

It is a rather large set of codes so I report here my main questions/issues but if needed I can provide more details and can try to give some reproducible issue cases.

#1 the training script I wanted to run with amp uses torch.stft for computing spectral losses
if computing these losses, I get
RuntimeError: arange_out not supported on CPUType for Half
which point to the stft operation. Is that correct that my script should not use spectral operations such as torch.stft to be optimized in mixed precision ? Or is there a fix/workaround for that please ?

#2 I tried to run the comparison only optimizing time domain losses (eg. waveform MSE instead of spectral reconstruction) so that the code runs without error for every opt_level, but then opt_level = O1 or O2 or O3 were all slower than opt_level = O0 (or running my original FP32 training) ... obviously I expected the speed gain to depend on the code, the operations involved, the batch sizes etc. but I did not expect it to be slowed down ...
For this I only used amp.initialize and amp.scale_loss (as recommended in the 1st example of https://nvidia.github.io/apex/amp.html). I train generative models composed mainly of conv1d, batchnorm1d and linear layers. Everything is feed-forward, no softmax or classification. What could I check to understand wether I could hope for speed gain or not in my application case please ?

Good luck developing the mixed precision training, it has a lot of potential if made more integrated in existing tools !

@ptrblck
Copy link
Contributor

ptrblck commented Jun 23, 2019

Hi @adrienchaton.

the issue in your first point seems to point to a CPU operation.
Are you manually pushing some activations back to the CPU to compute torch.stft?
Currently not all operations are implemented for FP16 tensors in PyTorch, so that this might yield this error.

Could you give us some information on the shapes you are using for the operations?
Note that GEMMs need sizes as multiples of 8 as explained here by @mcarilli.

A code snippet to debug and profile your use case would be helpful. :)

@adrienchaton
Copy link
Author

adrienchaton commented Jun 24, 2019

Hi @ptrblck

As described on the pytorch discussion, there is a behavior a bit weird when calling torch.hann_window.
Now it seems fixed with a couple of modifications on how data is casted, and the training/evaluation epochs run all the way.

I was already setting torch.backends.cudnn.benchmark=True in my codes before getting to know apex.amp but the specific shape considerations explained by @mcarilli I didn't know.

I modified the architecture parameters to get closer to the GEMM specifications. Which means now all conv1d channels are multiples of 8, except the single input channel of the first convolution and the single output channel of the last convolution (fixed by the single channel signals I process). And also every linear input/output sizes are now multiples of 8. In the speed comparison I ran, the model's layers are all conv1d and linear, plus some non-linear activations and 1d batch-norms.

About making the batch size a multiple of 8, here it gets more complicate for my use case .. The model trains on signals of variable lengths that are thus sliced in a variable number of sub-elements. I shuffle the training samples as signals, not as slices of signals, then I create the mini-batch from the slices of the considered signals of the mini-batch.

This means that the minibatches are of variable size which might not be a multiple of 8. Also, for that reason I do not use the pytorch dataloader because I cannot shape my training/test sets into a single set of tensors. Thus I do not use workers and pin_memory for iterating the minibatches ...
I guess this limits the speed improvement I could expect from optimizing my codes.

However, I ran a series of 10 epochs of the same training with different opt_levels that I call APEX_O 0/1/2/3 (3 with keep_batchnorm_fp32=True) and get the following:

APEX_O0 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207
averaged training losses [0. 0. 0.02565329 0.00043002]
averaged testing losses [0. 0. 0.01890221 0.00147927]
tr/val_loss and lr_updated [0.02565329 0.01890221 0.0002 0. ]
elapsed time = 0:01:22

APEX_O1 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207
averaged training losses [0. 0. 0.02568458 0.00045675]
averaged testing losses [0. 0. 0.01864722 0.0009563 ]
tr/val_loss and lr_updated [0.02568458 0.01864722 0.0002 0. ]
elapsed time = 0:01:28

APEX_O2 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207
averaged training losses [0. 0. 0.00219011 0.00055197]
averaged testing losses [0. 0. 0.00132275 0.00097656]
tr/val_loss and lr_updated [0.00219011 0.00132275 0.0002 0. ]
elapsed time = 0:03:20

APEX_O3 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207
averaged training losses [ 0. 0. nan nan]
averaged testing losses [ 0. 0. nan nan]
tr/val_loss and lr_updated [ nan nan 0.0002 0. ]
elapsed time = 0:03:22

According to this and with the updated parameters, it seems still that both mixed precisions are slower to compute in average. And that the pure FP16 does not optimize stable but this was possibly expected.

Here is some kind of pseudo code of what's happening during training:

# init of amp
if opt_lvl=="O3":
    model, optimizer = amp.initialize(model,optimizer,opt_level=opt_lvl,keep_batchnorm_fp32=True)
else:
    model, optimizer = amp.initialize(model,optimizer,opt_level=opt_lvl)

if opt_lvl!="O0":
    dtype=torch.float16
else:
    dtype=None

# ...

# start of a training epoch
train_id = np.arange(N_train)
np.random.shuffle(train_id)
ep_notes = np.array_split(train_id,int(N_train/mb_size_notes))
# here we have indexes of training signals shuffled and split into mini-batches
for mb_note in ep_notes:
    notes_in = [] # a minibatch of notes
    for note_id in mb_note:
        notes_in.append([train_notes[note_id][0].to(device,non_blocking=True),\
                         train_notes[note_id][1],train_notes[note_id][2]])
    # train_notes[note_id][1] is the number of slices to make / train_notes[note_id][2] are some corresponding labels
    mb_slices,mb_z,mb_gen = model.AE(notes_in,block_pp=True,itr=itr,dtype=dtype)
    # the AE function slices the signals into the desired number of elements
    # minibatch them as mb_slices and auto-encodes into mb_gen
    mmd_dist = model.MMD_regularization(mb_z,dtype=dtype)
    # computes the regularization with the model prior
    rec_loss = model.slice_rec(mb_slices,mb_gen)
    # both time and spectral domain reconstruction losses
    loss = rec_loss+mmd_dist
    optimizer.zero_grad()
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()

A main complication is that the signals train_notes[note_id][0] are of variable length so I cannot fit them into pytorch's dataloader ... maybe there would be some tricks to do that I dont know ? Or some recommended ways to optimize handling of training elements with variable size ?

thanks for your time and reading !

ps: afterwards I also need to be able to take mb_slices and mb_gen along with the variable number of slices per signal to shape back the individual intput signals from mb_slices (which were the train_notes[note_id][0]) along with their corresponding reconstructions from mb_gen.

@adrienchaton
Copy link
Author

adrienchaton commented Jun 24, 2019

Additionally to that I have some more warnings/errors which could cause the amp optimization not to work properly maybe ?

after the optimization is setup, I get the following warning
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ImportError("No module named 'amp_C'",)

however I run the install as recommended
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext"

this install is fulfilled but with an intermediate error (that I did not spot before, I tried re-installing)
ERROR: You must give at least one requirement to install (see "pip help install")

@ngimel
Copy link
Contributor

ngimel commented Jun 24, 2019

You are missing . at the end of the command line, it's not a punctuation that can be omitted, it actually tells pip what to install.

@adrienchaton
Copy link
Author

@ngimel thank you for pointing this out, the first time I read it I saw only . and not ./ so I didn't get the meaning .. I tried the install command with ./ at the end, it goes till the end but also raises an error:

ERROR: Command "/fast-2/adrien/virtualenv_p3/bin/python3 -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-req-build-rzct3jf4/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"
'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-lukq75f9/install-record.txt --single-version-externally-managed --compile --install-headers /fast-2/adrien/virtualenv_p3/
include/site/python3.5/apex" failed with error code 1 in /tmp/pip-req-build-rzct3jf4/

and running again the code with apex.amp still give the warning
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ImportError("No module named 'amp_C'",)

I hardly understand the install error .. do you have any idea about it please ?

@mcarilli
Copy link
Contributor

Can you uninstalling first before reinstalling just to be sure?

$ pip uninstall apex
$ pip uninstall apex # just to be sure
$ cd apex_repo_dir
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

In general it's hard to predict what exactly will be your network's bottleneck. Some potentially useful points:

  • Having the C++ backend (ie a successful compilation that no longer says "Warning: ...possibly because apex was installed without --cuda_ext --cpp_ext") enabled may give a good performance boost for mixed precision.
  • torch.backends.cudnn.benchmark=True forces cudnn to rerun its internal experiments to find the fastest algorithm whenever a convolution with new size parameters is encountered. These experiments are expensive. Therefore, in your use case, where input sizes vary, it may be better to run with torch.backends.cudnn.benchmark=False, in which case cudnn will make a quick best guess of what algorithm to run. torch.backends.cudnn.benchmark=False will not guarantee that the absolute-fastest algorithm will be selected, but the selection process itself will become quick, which may be an overall performance win.
  • If you really want to understand your network's performance, there's no substitute for visual profiling. I've prepared a profiling guide based on apex/examples/imagenet; also see Dose data_prefetcher() really speed up training? #304 (comment).

@adrienchaton
Copy link
Author

adrienchaton commented Jun 24, 2019

@mcarilli Thank you for your insights !

When I pip installed apex into my virtualenv, it did not make the environment path to it, so I just tried it first with importing apex as

import sys
sys.path.append('the_apex_dir')
from apex import amp, optimizers

which means the pip uninstall apex command does not find apex installed but I deleted manually the directory where was the installation and tried re-cloning/installing but got the same error in the course of installation, which continues and ends but it seems restricted to the Python-only build .. (the servers I am using are setup with cuda 9.1.85 and cudnn 7.1.3)

my tomorrow step was to cProfile the FP32 code to see what I could improve, probably a lot as the data handling is a bit 'off the beaten track' of the usual and efficient pytorch dataloaders .. I will do that but if you have some ideas on how to fix the apex installation, I would be very interested to test apex.amp with the C++ backend, maybe with a bit of prior code improvement it could all-in-all get much faster !

ultimately and if the C++ backend can install, then it would make sense to follow your apex profiling guide to try to gain even more speed from the mixed precision or FP16 code version ?

and thanks for pointing out that torch.backends.cudnn.benchmark=True is not always the right choice .. I thought that on the long run it was always (that over 40-60h training the benchmarking is always beneficial) but with this variable minibatch size it may be not ? I am unsure of what cudnn sees as new size parameters ; the convolutions are unchanged throughout the training, same channels in/out, same kernels etc.

@mcarilli
Copy link
Contributor

mcarilli commented Jun 24, 2019

Input (data) sizes do count as new size parameters from the perspective of convolutions, so I expect your case will benefit from torch.backends.cudnn.benchmark=False.

You should not simply import apex from the cloned repo directory itself. There is a danger of this happening by mistake, if the apex repo directory is a top-level subdirectory in the folder where you are running your script, because the current directory will be on your PYTHONPATH.

Apex must be imported from wherever it is installed on your system. You can check how your script attempted to import Apex by including the following lines in your script:

import sys
import apex
print("Imported apex from ", sys.modules['apex'])

The print statement should show a path to one of your environment's Python library directories, NOT the path to the cloned repo.

@adrienchaton
Copy link
Author

@mcarilli Thank you for the precisions about torch.backends.cudnn.benchmark, I will try disabling it in this case.

Regarding the installation, doing an additional python setup.py install after the pip install creates apex into python3.5/site-packages as apex-0.1-py3.5.egg ; then I can import apex without pointing to the cloned repo but to site-packages. However, with this import from python environment, I still get the warning that Warning: multi_tensor_applier fused unscale kernel is unavailable ....

Now I can pip uninstall apex, I tried cloning/reinstalling all, it did not fix the import with C++ backend which seems still missing as the warning says.

And if I try python setup.py install --cpp_ext --cuda_ext then I get the following error:

torch.version = 1.1.0

Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
from /usr/local/cuda/bin

Traceback (most recent call last):
File "setup.py", line 64, in
check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
File "setup.py", line 54, in check_cuda_torch_binary_vs_bare_metal
"#323 (comment). "
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 9.0.176.
In some cases, a minor-version mismatch will not cause later errors: #323 (comment). You can try commenting out this check (at your own risk).

This may be the reason why the backend doesnt work right ? Should I reinstall pytorch first ? Or try updating cudnn ?

@adrienchaton
Copy link
Author

adrienchaton commented Jun 25, 2019

And I tried to make the python setup.py install --cpp_ext --cuda_ext with commenting out
check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)

It seems to allow installation only with one warning
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++

But still when initializing amp I get the warning that the C++ backend installation is not working
Warning: multi_tensor_applier fused unscale kernel is unavailable ...

I guess I need to build pytorch from source with the matching cuda, not
Pytorch binaries were compiled with Cuda 9.0.176
but with 9.1.85 matching my CUDA setup

let's go !

@adrienchaton
Copy link
Author

adrienchaton commented Jun 25, 2019

I am having issues with installing from source, please anyone could help ? @ptrblck @ngimel @mcarilli

I already installed pytorch from source on my local OSX laptop without issues and it runs great with a thunderbolt eGPU, NVIDIA webdrivers, cuda and cudnn.

Our servers are Linux and we do not use conda.
I made sure to have all dependencies installed and up to date
pip install numpy pyyaml mkl mkl-include setuptools cmake cffi typing

Then I clone
git clone --recursive https://github.com/pytorch/pytorch
and try installing
python setup.py install
edit I was here already running python setup.py install --cpp_ext --cuda_ext
I also try checking the update
git submodule sync ; git submodule update --init --recursive

But in every case I end up to the following error:

-- Found cuDNN: v7.1.3 (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so.7)
CMake Error at cmake/public/cuda.cmake:340 (message):
CUDA 9.1 is not compatible with std::tuple from GCC version >= 6. Please
upgrade to CUDA 9.2 or set the following environment variable to use
another version (for example):

export CUDAHOSTCXX='/usr/bin/gcc-5'

Call Stack (most recent call first):
cmake/Dependencies.cmake:808 (include)
CMakeLists.txt:270 (include)

-- Configuring incomplete, errors occurred!
See also "/fast-2/adrien/pytorch/build/CMakeFiles/CMakeOutput.log".
See also "/fast-2/adrien/pytorch/build/CMakeFiles/CMakeError.log".
Traceback (most recent call last):
File "setup.py", line 754, in
build_deps()
File "setup.py", line 327, in build_deps
cmake=cmake)
File "/fast-2/adrien/pytorch/tools/build_pytorch_libs.py", line 61, in build_caffe2
rerun_cmake)
File "/fast-2/adrien/pytorch/tools/setup_helpers/cmake.py", line 355, in generate
self.run(args, env=my_env)
File "/fast-2/adrien/pytorch/tools/setup_helpers/cmake.py", line 110, in run
check_call(command, cwd=self.build_dir, env=env)
File "/usr/lib/python3.5/subprocess.py", line 271, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '-GNinja', '-DBUILD_PYTHON=True', '-DBUILD_TEST=True', '-DCAFFE2_STATIC_LINK_CUDA=False', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_CXX_FLAGS= ', '-DCMAKE_C_FLAGS= ', '-DCMAKE_EXE_LINKER_FLAGS=', '-DCMAKE_INSTALL_PREFIX=/fast-2/adrien/pytorch/torch', '-DCMAKE_PREFIX_PATH=/../', '-DCMAKE_SHARED_LINKER_FLAGS=', '-DINSTALL_TEST=True', '-DNAMEDTENSOR_ENABLED=False', '-DNCCL_EXTERNAL=True', '-DNUMPY_INCLUDE_DIR=/fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/numpy/core/include', '-DPYTHON_EXECUTABLE=/fast-2/adrien/virtualenv_p3/bin/python', '-DPYTHON_INCLUDE_DIR=/usr/include/python3.5m', '-DPYTHON_LIBRARY=/usr/lib/libpython3.5m.so.1.0', '-DTHD_SO_VERSION=1', '-DTORCH_BUILD_VERSION=1.2.0a0+b61693c', '-DUSE_CUDA=True', '-DUSE_DISTRIBUTED=True', '-DUSE_FBGEMM=True', '-DUSE_MKLDNN=True', '-DUSE_NCCL=True', '-DUSE_NNPACK=True', '-DUSE_NUMPY=True', '-DUSE_QNNPACK=True', '-DUSE_ROCM=False', '-DUSE_SYSTEM_EIGEN_INSTALL=OFF', '-DUSE_SYSTEM_NCCL=False', '-DMKLDNN_ENABLE_CONCURRENT_EXEC=ON', '/fast-2/adrien/pytorch']' returned non-zero exit status 1

I tried to put first export CUDAHOSTCXX='/usr/bin/gcc-5'

but still the install or build give the same error .. no way than updating CUDA to 9.2 ?

thanks !

@mcarilli
Copy link
Contributor

mcarilli commented Jun 25, 2019

python setup.py install will perform a Python-only install, so it's not surprising you see the warnings that apex_C is unavailable.

I don't think you need to build Pytorch from source on your servers. I think it will probably be ok to use the existing Pytorch installation. When you installed after commented out these lines of the Apex setup.py, I think it's possible you simply had a conflicting Python-only install hanging around somewhere.
Keep https://github.com/NVIDIA/apex/blob/master/setup.py#L49-L55 commented out, but make sure to remove all potentially conflicting installs:

pip uninstall apex
pip uninstall apex # until it says Apex isn't installed...
python setup.py install --cuda_ext --cpp_ext

Running in a Docker container can also help avoid environment issues, since we are able to explicitly test the install in containers.

@adrienchaton
Copy link
Author

sorry .. in between the posts I forgot writing the flags but it was python setup.py install with --cuda_ext --cpp_ext already

I pip uninstalled apex fully, I also reinstalled pytorch with pip (after uninstalling pytorch and torchvision)
not using the source build, it goes with CUDA 9.0.176 (which doesn't follow my native cuda 9.1.85 but maybe it's ok)

then I cloned again apex, comment out lines 49-55 which is the torch cuda vs bare metal check

inside the clone directory I run
python setup.py install --cpp_ext --cuda_ext
and it throws a lot lot of errors ..

tomorrow afternoon we stop one server and update cuda before trying commenting out the setup codes .. the error report is quite huge now .. I will let you know how it goes tomorrow, we want to put one test server on cuda 10 and cuDNN>=7.3 and try mixed precision on it

@adrienchaton
Copy link
Author

Before we update this machine, I tried running again the install from scratch, making sure everything about apex is pip uninstalled before, directories deleted and cloned again and lines 49-55 of setup.py commented out.

running python setup.py install --cpp_ext --cuda_ext gives the following errors and do not complete installation.
error apex python setup.txt

running pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ with the lines commented as well gives the following errors but ends the installation, however when running apex, it still gives the warning that the backend is not available after this installation.
error apex pip install.txt

I dont know if it is of any help but in case, here are the error reports.

@ptrblck
Copy link
Contributor

ptrblck commented Jun 26, 2019

Did you have a change to try installing apex in a docker container as suggested by @mcarilli?

The first log file seems to point to a GCC version error, while the second log file doesn't seem to indicate any errors.

@adrienchaton
Copy link
Author

I had a GCC error already when trying to build pytorch from source on the server (to have matching CUDA versions) which said I should either upgrade CUDA>=9.2 or use gcc-5 and not 6 (but the server doesnt have gcc-5 and the IT guys said we rather upgrade CUDA).

I looked into the dockers, never used that neither opened an NGC account yet.
I can try this but I am not sure I understand how it goes.

I do not have already any docker container, so option 2 is not for me ? Or is it rather what you would recommend ?

And about option 1, Dockerfile installs the latest Apex on top of an existing image.
Things are already a bit convoluted so I am not sure what the image refers too ..
Will it install a new pytorch along with apex tools ?
Or should I point a BASE_IMAGE to my existing pytorch and cuda libraries ?
And then for running the container, is it something like a virtualenv ? Should I run it inside my current virtualenv as an extension of it or is it like a new standalone virtualenv ?

@ptrblck
Copy link
Contributor

ptrblck commented Jun 26, 2019

Image refers to the base container, which would be the pytorch/pytorch:nightly-devel-cuda10.0-cudnn7 one.
This base image will be downloaded, so you don't have to point to your existing installs.

The container will isolate the pytorch and apex installation, and provide a clean and fresh Ubuntu inside of it.
You have to install NVIDIA drivers for your GPUs, but besides that the docker container will ship with everything else (CUDA, cudnn, gcc, etc.).

You don't have to run the docker container inside a virtual environment.

@adrienchaton
Copy link
Author

Thank you @ptrblck for the explanations.

So if I make that docker container, then inside I will install my own drivers and also install the libraries I need (for instance the ones I pip installed inside my virtualenv) as if I had a new machine ?
It makes sense as they compare to virtual machines and not virtual environments.

That could be an optimal solution for trying out apex without need to change the current servers ' main systems. I am proposing this to the IT service, we will see if that's possible, thanks !

@ptrblck
Copy link
Contributor

ptrblck commented Jun 26, 2019

You would have to install the drivers on your bare metal (not inside the container).

If you follow the docker install guide from @mcarilli, your container will already come with a working PyTorch, nvcc, CUDA, cudnn etc.
Of course you might install additional libraries, e.g. apex in this case.

@mcarilli
Copy link
Contributor

You might not need to reinstall Cuda, or the bare-metal driver. It might just be a matter of picking the right container. For example, if you have cuda 9.2 installed on your servers, you can use a pytorch 1.0, cuda 9.2 devel container from https://hub.docker.com/r/pytorch/pytorch/tags.

$ nvidia-smi will tell you the driver version.

@adrienchaton
Copy link
Author

Finally it's importing correctly !
So we have CUDA 10.1 ; CUDNN 7.6 and pytorch 1.1.0 (pip install so it uses torch.version.cuda=='10.0.130').

When I try to run on the default settings of opt_level O1 or O2 ; very early in the course of the first epoch I have the following:

File "/fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/apex/amp/scaler.py", line 193, in update_scale
self._has_overflow = self._overflow_buf.item()
RuntimeError: CUDA error: an illegal memory access was encountered

I guess it needs some practice to be used well. I also looked for this RuntimeError in the issues.

In this case, the code runs on a single GPU (titan v) and is optimized with adam.
On top of that I use
model, optimizer = amp.initialize(model,optimizer,opt_level=opt_lvl)
and
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()

if running opt_level O0 no issues, about this I am considering something ..

what about making a few initial epochs pure FP32 and once the model is better initialized, continue the training with O1 or O2 to have faster training. Basically, the initial loss gradients of my model are quite large and variable (init dependent) however after a couple of epochs it gets smoother and probably less prone to extra instabilities from the mixed precision ..

Anything else I could try or check please ?
Maybe it's a weird idea, just trying now and see but if you have some more recommendations I am happy to try since the installation finally seems worked out !

@mcarilli
Copy link
Contributor

How early in the first epoch does this occur? Immediately (on the first iteration) or after several dozen iterations? Immediately would imply some functionality bug. After several dozen iterations would imply some numerical issue.

@adrienchaton
Copy link
Author

adrienchaton commented Jun 27, 2019

first iteration, at the scaled_loss.backward()

Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
keep_batchnorm_fp32 : None
patch_torch_functions : True
enabled : True
cast_model_type : None
loss_scale : dynamic
opt_level : O1
master_weights : None
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
keep_batchnorm_fp32 : None
patch_torch_functions : True
enabled : True
cast_model_type : None
loss_scale : dynamic
opt_level : O1
master_weights : None

then no warning, but maybe still some bugs ? what could I check please ?

it does the same with the default opt_level O2
but not with opt_level O3 with keep_batchnorm_fp32=True

@adrienchaton
Copy link
Author

adrienchaton commented Jun 27, 2019

and if I run the same with opt_level O0 and O3(keep_batchnorm_fp32=True)

it ends up as

*** APEX_O0 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207
averaged training losses [0. 0. 0.03716807 0.00047766]
averaged testing losses [0. 0. 0.0299097 0.00211793]
tr/val_loss and lr_updated [0.03716807 0.0299097 0.0002 0. ]
elapsed time = 0:01:20

*** APEX_O3 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207
averaged training losses [ 0. 0. nan nan]
averaged testing losses [ 0. 0. nan nan]
tr/val_loss and lr_updated [ nan nan 0.0002 0. ]
elapsed time = 0:02:21

so O3 runs, makes some NaN but it's more or less expected
however it goes slower than O0 which is not expected

@ngimel
Copy link
Contributor

ngimel commented Jun 27, 2019

For illegal memory access issue, can you please run with environment variable CUDA_LAUNCH_BLOCKING set to 1? (CUDA_LAUNCH_BLOCKING=1 python my_script.py ....)? This will probably give more informative error message.
For performance issue, can you either provide a minimum script that reproduces slower performance with O3 (it can run on synthetic data), or collect profiles with nvprof (nvprof -o prof1.nvvp python my_script.py ...) for fp32 and fp16 runs and share them?

@adrienchaton
Copy link
Author

adrienchaton commented Jun 27, 2019

I re-used https://github.com/pytorch/examples/mnist
and modified main.py as fp16_main.py just adding up the amp initialization and the scaled loss backward

this one runs on every opt_level (for 3 I just use the default and anyway the example doesnt use batchnorm)

I ran time python fp16_main.py >/dev/null for each opt_level

O0 ends with
real 2m18,085s
user 12m6,164s
sys 0m12,860s

O1 ends with
real 2m46,045s
user 12m49,328s
sys 0m15,332s

O2 ends with
real 2m15,303s
user 12m7,332s
sys 0m13,848s

O3 ends with
real 2m26,401s
user 12m13,896s
sys 0m14,164s

fp16_main.py.txt

system wise it seems only getting slower but at least it runs without CUDA error

following I give you the report with CUDA_LAUNCH_BLOCKING=1 ; thanks for the advices and help !

start iteration 0
Traceback (most recent call last):
File "float16_pptrainbis_WAE_MCNN_01.py", line 559, in
scaled_loss.backward()
File "/usr/lib/python3.5/contextlib.py", line 66, in exit
next(self.gen)
File "/fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/apex/amp/handle.py", line 127, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/apex/amp/_process_optimizer.py", line 229, in post_backward_no_master_weights
models_are_masters=True)
File "/fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/apex/amp/scaler.py", line 116, in unscale
1./scale)
File "/fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in call
*args)
RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:103)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f2f72f63441 in /fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f2f72f62d7a in /fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #2: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0x2866 (0x7f2eee519886 in /fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/amp_C.cpython-35m-x86_64-linux-gnu.so)
frame #3: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0x431 (0x7f2eee514e81 in /fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/amp_C.cpython-35m-x86_64-linux-gnu.so)
frame #4: + 0x172ac (0x7f2eee5132ac in /fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/amp_C.cpython-35m-x86_64-linux-gnu.so)
frame #5: + 0x1738e (0x7f2eee51338e in /fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/amp_C.cpython-35m-x86_64-linux-gnu.so)
frame #6: + 0x138b7 (0x7f2eee50f8b7 in /fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/amp_C.cpython-35m-x86_64-linux-gnu.so)

frame #38: __libc_start_main + 0xf1 (0x7f2f775422e1 in /lib/x86_64-linux-gnu/libc.so.6)

@adrienchaton
Copy link
Author

about the numerical instabilities vs bug, I am not sure since the pytorch mnist example runs on every opt_level but here I tried to run my code with 10 epochs FP32 then initializing amp and backward on the scaled_loss gives directly the CUDA error

Here it optimizes correctly (gradient descent) in FP32
*** APEX_O1 - EPOCH #8 out of 20 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 161
averaged training losses [0. 0. 0.02693096 0.00047495]
averaged testing losses [0. 0. 0.02001604 0.00160885]
tr/val_loss and lr_updated [0.02693096 0.02001604 0.0002 0. ]
elapsed time = 0:01:05

*** APEX_O1 - EPOCH #9 out of 20 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 184
averaged training losses [0. 0. 0.0266892 0.00049743]
averaged testing losses [0. 0. 0.01977739 0.00063139]
tr/val_loss and lr_updated [0.0266892 0.01977739 0.0002 0. ]
elapsed time = 0:01:13

*** APEX_O1 - EPOCH #10 out of 20 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 207
averaged training losses [0. 0. 0.02648677 0.000482 ]
averaged testing losses [0. 0. 0.01960668 0.00081408]
tr/val_loss and lr_updated [0.02648677 0.01960668 0.0002 0. ]
elapsed time = 0:01:20

*** APEX_O1 - EPOCH #11 out of 20 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 230

Here at the start of epoch #11 I start mixed precision and gets the error from the first iteration of epoch #11

Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
opt_level : O1
loss_scale : dynamic
enabled : True
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
opt_level : O1
loss_scale : dynamic
enabled : True
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
start iteration 230
Traceback (most recent call last):
File "float16_pptrainbis_WAE_MCNN_01.py", line 575, in
scaled_loss.backward()
File "/usr/lib/python3.5/contextlib.py", line 66, in exit
next(self.gen)
File "/fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/apex/amp/handle.py", line 131, in scale_loss
should_skip = False if delay_overflow_check else loss_scaler.update_scale()
File "/fast-2/adrien/virtualenv_p3/lib/python3.5/site-packages/apex/amp/scaler.py", line 193, in update_scale
self._has_overflow = self._overflow_buf.item()
RuntimeError: CUDA error: an illegal memory access was encountered

@adrienchaton
Copy link
Author

adrienchaton commented Jun 27, 2019

it's a bit hard to understand what's wrong or not ..

on the same server, slot 0 and 1 are both equipped with titan v
sending the same code to cuda:0 seems to be functioning different than on cuda:1

slot 0 in the first iteration says
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
slot 1 gives the CUDA error of illegal memory access ...

(same virtual env in the same machine)
note: the above runs of the mnist example were all on GPU 0 so accordingly, I guess it is why it ran on every opt_level

so using only slot 0 and running more iterations per epoch, after 10 epochs, timing ends up as:

*** APEX_O0 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 729
averaged training losses [0. 0. 0.07547674 0.00040944]
averaged testing losses [0. 0. 0.06774963 0.00775905]
tr/val_loss and lr_updated [0.07547674 0.06774963 0.0002 0. ]
elapsed time = 0:06:11

*** APEX_O1 - EPOCH #10 out of 10 with SC,logSC,wave,reg == 0.0 0.0 0.1 1.0 and SC_mode= MSE ; current itr= 729
averaged training losses [0. 0. 0.09525494 0.00040272]
averaged testing losses [0. 0. 0.08090863 0.00112818]
tr/val_loss and lr_updated [0.09525494 0.08090863 0.0002 0. ]
elapsed time = 0:07:08

tomorrow when the IT system is open again, I will ask for nvprof
it is at nvprof: /usr/share/man/man1/nvprof.1
but the command cannot be found so probably it's not configured yet for me to use it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants