Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Add] test atleast_xd pir backward #59365

Merged
merged 4 commits into from
Dec 5, 2023

Conversation

megemini
Copy link
Contributor

PR types

Others

PR changes

Others

Description

这个 PR 主要涉及两个部分:

目前发现 reshape、concat 好像都有点问题 ~

以下测试环境,aistudio:

Python 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import paddle
paddle.
In [2]: paddle.version.commit
Out[2]: '907e42525e00d7dc3257672399d36aabbdaf8114'
  • reshape 的问题

测试示例:

import paddle
from paddle.pir_utils import test_with_pir_api

@test_with_pir_api
def test_static_api():
    paddle.enable_static()

    with paddle.static.program_guard(paddle.static.Program()):
        x = paddle.static.data('x', (), 'float64')
        # 这里不应该影响后面需要的梯度回传
        # 在 pir 下,这里必须设置为 False,否则出错
        # 在旧的静态图下没有问题
        # x.stop_gradient = False

        feed = {'x': 123.0}

        out = paddle.reshape(x, (1,))

        out.stop_gradient = False

        z = out * 123

        fetch_list = [out]
        if paddle.framework.in_pir_mode():
            grads = paddle.autograd.ir_backward.grad(z, out)
            out_grad = grads[0]
            fetch_list.append(out_grad)
        else:
            paddle.static.append_backward(z)
            out_grad = out.grad_name
            fetch_list.append(out_grad)

        exe = paddle.static.Executor()
        *res, res_grad = exe.run(feed=feed, fetch_list=fetch_list)

        print(res, res_grad)

if __name__ == '__main__':
    test_static_api()

执行出错:

Traceback (most recent call last):
  File "/home/aistudio/test_atleast/test_refresh_stopgradient.py", line 39, in <module>
    test_static_api()
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/pir_utils.py", line 115, in impl
    func(*args, **kwargs)
  File "/home/aistudio/test_atleast/test_refresh_stopgradient.py", line 34, in test_static_api
    *res, res_grad = exe.run(feed=feed, fetch_list=fetch_list)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/executor.py", line 1715, in run
    res = self._run_pir_impl(
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/executor.py", line 1993, in _run_pir_impl
    fetch_list = self._check_fetch_list(fetch_list)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/executor.py", line 2096, in _check_fetch_list
    raise TypeError(
TypeError: Require fetch_list[1] 's type shall be one of (OpResult, str), but received NoneType.

初步定位是 paddle/autograd/ir_backward.py 中的 block.refresh_stopgradient() 问题:

> /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/autograd/ir_backward.py(672)calc_gradient_helper()
-> block.refresh_stopgradient()
(Pdb) l
667  
668     def calc_gradient_helper(outputs, inputs, grad_outputs, no_grad_set):
669         block = outputs[0].get_defining_op().get_parent_block()
670  
671         pdb.set_trace()
672  ->     block.refresh_stopgradient()
673  
674         state = State(block.program)
675  
676         # check all inputs and outputs in the same block
677         check_all_puts(block, inputs, outputs)
(Pdb) p inputs
[<paddle.base.libpaddle.pir.OpResult object at 0x7f20ace72130>]
(Pdb) p inputs[0].stop_gradient
False
(Pdb) p outputs
[<paddle.base.libpaddle.pir.OpResult object at 0x7f20aced8930>]
(Pdb) p outputs[0].stop_gradient
False
(Pdb) a
outputs = [<paddle.base.libpaddle.pir.OpResult object at 0x7f20aced8930>]
inputs = [<paddle.base.libpaddle.pir.OpResult object at 0x7f20ace72130>]
grad_outputs = []
no_grad_set = set()
(Pdb) s
> /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/autograd/ir_backward.py(674)calc_gradient_helper()
-> state = State(block.program)
(Pdb) p inputs[0].stop_gradient
True
(Pdb) p outputs[0].stop_gradient
True
(Pdb) a
outputs = [<paddle.base.libpaddle.pir.OpResult object at 0x7f20aced8930>]
inputs = [<paddle.base.libpaddle.pir.OpResult object at 0x7f20ace72130>]
grad_outputs = []
no_grad_set = set()
(Pdb) 

可以看到,在执行完 block.refresh_stopgradient() 后,原来 stop_gradient 为 False 的变为了 True,从而导致后续无法获得梯度。

如果将上面示例代码中的 # x.stop_gradient = False 注释打开,这样可以正常运行 ~ 但是这与旧的静态图以及动态图的行为不符 ~

参考 https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/faq/train_cn.html#stop-gradient-true ,x 的梯度回传不应该影响到后面的执行 ~

  • concat 的问题

测试示例:

import pdb
import numpy as np
import paddle
from paddle.pir_utils import test_with_pir_api

@test_with_pir_api
def test_static_api():
    paddle.enable_static()

    with paddle.static.program_guard(paddle.static.Program()):
        x = paddle.static.data('x', (1, 2), 'float64')
        y = paddle.static.data('y', (1, 2), 'float64')

        # 这里不应该影响后面需要的梯度回传
        # 在 pir 下,这里必须设置为 False,否则出错
        # 在旧的静态图下没有问题
        x.stop_gradient = False
        y.stop_gradient = False

        feed = {'x': np.random.rand(1, 2), 'y': np.random.rand(1, 2)}

        pdb.set_trace()
        out = paddle.concat((x, y), axis=0)

        out.stop_gradient = False

        y = out * 123

        fetch_list = [out]
        if paddle.framework.in_pir_mode():
            grads = paddle.autograd.ir_backward.grad(y, out)
            out_grad = grads[0]
            fetch_list.append(out_grad)
        else:
            paddle.static.append_backward(y)
            out_grad = out.grad_name
            fetch_list.append(out_grad)

        exe = paddle.static.Executor()

        pdb.set_trace()
        *res, res_grad = exe.run(feed=feed, fetch_list=fetch_list)

        print(res, res[0].shape, res_grad)

if __name__ == '__main__':
    test_static_api()

执行出错:

Traceback (most recent call last):
  File "/home/aistudio/test_stack_extension/test_concat_pir.py", line 47, in <module>
    test_static_api()
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/pir_utils.py", line 113, in impl
    func(*args, **kwargs)
  File "/home/aistudio/test_stack_extension/test_concat_pir.py", line 42, in test_static_api
    *res, res_grad = exe.run(feed=feed, fetch_list=fetch_list)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/executor.py", line 1725, in run
    res = self._run_impl(
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/executor.py", line 1932, in _run_impl
    ret = new_exe.run(
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/executor.py", line 825, in run
    tensors = self._new_exe.run(
ValueError: In user code:

    File "/home/aistudio/test_stack_extension/test_concat_pir.py", line 47, in <module>
      test_static_api()
    File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/pir_utils.py", line 113, in impl
      func(*args, **kwargs)
    File "/home/aistudio/test_stack_extension/test_concat_pir.py", line 23, in test_static_api
      out = paddle.concat((x, y), axis=0)
    File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/tensor/manipulation.py", line 1311, in concat
      helper.append_op(
    File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/layer_helper.py", line 44, in append_op
      return self.main_program.current_block().append_op(*args, **kwargs)
    File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/framework.py", line 4451, in append_op
      op = Operator(
    File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/framework.py", line 2999, in __init__
      for frame in traceback.extract_stack():

    InvalidArgumentError: Source and destination tensor should have the same dimension size, but source tensor dimension size is 1, destination tensor size is 2.
      [Hint: Expected src_stride_numel.size() == dst_stride_numel.size(), but received src_stride_numel.size():1 != dst_stride_numel.size():2.] (at /paddle/paddle/phi/kernels/funcs/strided_memcpy.h:105)
      [operator < concat_grad > error]

初步定位是 paddle/base/executor.py 中的 self._new_exe.run 问题:

> /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/executor.py(825)run()
-> tensors = self._new_exe.run(
(Pdb) l
820                     (the Tensor specified in the fetch list) to numpy.ndarray. if it is False,
821                     the type of the return value is a list of :code:`LoDTensor`. The default is True.
822             """
823             import pdb
824             pdb.set_trace()
825  ->         tensors = self._new_exe.run(
826                 feed_names, enable_job_schedule_profiler
827             )._move_to_list()
828             if return_numpy:
829                 tensors = as_numpy(tensors, copy=True)
830                 if not get_flags("FLAGS_enable_pir_in_executor")[
(Pdb) p feed_names
['x', 'y']
(Pdb) p enable_job_schedule_profiler
False
(Pdb) s
> /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/executor.py(826)run()
-> feed_names, enable_job_schedule_profiler
(Pdb) s
> /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/executor.py(825)run()
-> tensors = self._new_exe.run(
(Pdb) s
I1125 15:42:22.781333 17598 program_interpreter.cc:202] New Executor is Running.
I1125 15:42:22.782413 17598 interpreter_util.cc:612] Standalone Executor is Used.
ValueError: In user code:

    File "/home/aistudio/test_stack_extension/test_concat_pir.py", line 82, in <module>
      test_static_api()
    File "/home/aistudio/test_stack_extension/test_concat_pir.py", line 60, in test_static_api
      out = vstack((x, y))
    File "/home/aistudio/test_stack_extension/test_concat_pir.py", line 11, in vstack
      return paddle.concat(arrays, axis=0, name=name)
    File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/tensor/manipulation.py", line 1311, in concat
      helper.append_op(
    File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/layer_helper.py", line 44, in append_op
      return self.main_program.current_block().append_op(*args, **kwargs)
    File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/framework.py", line 4451, in append_op
      op = Operator(
    File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/framework.py", line 2999, in __init__
      for frame in traceback.extract_stack():

    InvalidArgumentError: Source and destination tensor should have the same dimension size, but source tensor dimension size is 1, destination tensor size is 2.
      [Hint: Expected src_stride_numel.size() == dst_stride_numel.size(), but received src_stride_numel.size():1 != dst_stride_numel.size():2.] (at /paddle/paddle/phi/kernels/funcs/strided_memcpy.h:105)
      [operator < concat_grad > error]
> /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/paddle/base/executor.py(825)run()
-> tensors = self._new_exe.run(

在这个测试示例中,x/y 的 stop_gradient 都设置为 False,否则就会出现之前 reshape 的问题(无法梯度回传),但是设置之后仍然出现上述问题。

上述示例旧的静态图也有问题,只有一种情况可以正常运行:

  • @test_with_pir_api 注释掉这句,也就是只跑旧的静态图
  • x/y 的 stop_gradient 必须为 True,否则也会报错。

综上,目前我这边的这三个 PR 涉及到的 api 所依赖的 reshape、concat 的静态图梯度回传都有问题(之前测试到 split 的好像也有问题,但似乎这两天被解决了?)

还请帮忙看一下 ~ 非常感谢 ~

@luotao1

Copy link

paddle-bot bot commented Nov 25, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@xiaoguoguo626807
Copy link
Contributor

关于refresh 的梯度问题, 该设置是为了解决用户在组网后设置的stopgradinet 不生效问题,建议测试在组网前进行设置。concat的stride 执行反向报错可能和stride 有关。从网络中间进行梯度传播是必要的吗

@megemini
Copy link
Contributor Author

关于refresh 的梯度问题, 该设置是为了解决用户在组网后设置的stopgradinet 不生效问题,建议测试在组网前进行设置。

这要怎么写测试用例?

concat的stride 执行反向报错可能和stride 有关。从网络中间进行梯度传播是必要的吗

这两个测试程序的逻辑基本一样:x (或者 x 和 y) -> out = reshape/concat (x 或者 x, y) -> z = out * 123 然后测试从 z 往前的梯度回传 ~

第二个测试用例的 y = out * 123 可以改为 z = out * 123 ,后面也都是用 z,跟之前的两个输入 x/y 没有关系,只是变量名而以 ~

import pdb
import numpy as np
import paddle
from paddle.pir_utils import test_with_pir_api

@test_with_pir_api
def test_static_api():
    paddle.enable_static()

    with paddle.static.program_guard(paddle.static.Program()):
        x = paddle.static.data('x', (1, 2), 'float64')
        y = paddle.static.data('y', (1, 2), 'float64')

        # 这里不应该影响后面需要的梯度回传
        # 在 pir 下,这里必须设置为 False,否则出错
        # 在旧的静态图下没有问题
        x.stop_gradient = False
        y.stop_gradient = False

        feed = {'x': np.random.rand(1, 2), 'y': np.random.rand(1, 2)}

        # pdb.set_trace()
        out = paddle.concat((x, y), axis=0)

        out.stop_gradient = False

        z = out * 123

        fetch_list = [out]
        if paddle.framework.in_pir_mode():
            grads = paddle.autograd.ir_backward.grad(z, out)
            out_grad = grads[0]
            fetch_list.append(out_grad)
        else:
            paddle.static.append_backward(z)
            out_grad = out.grad_name
            fetch_list.append(out_grad)

        exe = paddle.static.Executor()

        # pdb.set_trace()
        *res, res_grad = exe.run(feed=feed, fetch_list=fetch_list)

        print(res, res[0].shape, res_grad)

if __name__ == '__main__':
    test_static_api()

也是会报错的,不是从网络中间梯度回传 ~ 之前笔误不好意思 ~

@xiaoguoguo626807
Copy link
Contributor

如果确实有需求,x, y -> out -> z 网络只求z对out 的梯度,我们可以删除refresh操作。
关于同时对x,y 进行求导遇到的concat 反向报错我们内部在查

@megemini
Copy link
Contributor Author

如果确实有需求,x, y -> out -> z 网络只求z对out 的梯度,我们可以删除refresh操作。

我觉得最好是能够跟旧静态图保持一致,以保证兼容 ~ 官方文档也说,"如果静态图中某一层使用stop_gradient=True,那么这一层之前的层都会自动 stop_gradient=True,梯度不再回传。" 是比较合理的处理方式 ~

关于同时对x,y 进行求导遇到的concat 反向报错我们内部在查

赞!:)

@luotao1 帮忙看一下:

另外,这个 PR 先挂着等问题解决?

谢谢!

@changeyoung98
Copy link
Contributor

关于stop gradient refresh的问题,已经删除了backward中的相关逻辑,后续PIR组网中需要在创建data的阶段标明stop gradient为false。目前pr还在CI中,待合入。# 59579
本地复现了一下concat报错,发现是旧IR下的concat_grad有问题,旧IR目前已经不再维护,PIR下无误。你的组网代码有一些问题,会导致concat被剪掉,反向未被加入网络中,可以参考下面的写法,PIR可以正常组网运行。
image

@megemini
Copy link
Contributor Author

megemini commented Dec 1, 2023

@xiaoguoguo626807 @changeyoung98 非常感谢二位帮忙定位问题 ~~~ 👍👍👍

平时写应用都是用动态图,所以静态图里面很多东西还不熟悉,还请谅解 ~~~ 🙏🙏🙏

这里还有几个小问题,烦请如果有时间的话帮忙看一下:

关于stop gradient refresh的问题,已经删除了backward中的相关逻辑,后续PIR组网中需要在创建data的阶段标明stop gradient为false。目前pr还在CI中,待合入。# 59579

下列示例代码:

    with paddle.static.program_guard(paddle.static.Program()):
        x = paddle.static.data('x', (), 'float64')

        # A : 在 reshape 之前

        feed = {'x': 123.0}
        out = paddle.reshape(x, (1,))

        # B : 在 reshape 之后

        z = out * 123

        # C : 在相乘之后

        fetch_list = [out]
        if paddle.framework.in_pir_mode():
            grads = paddle.autograd.ir_backward.grad(z, out)
            out_grad = grads[0]
            fetch_list.append(out_grad)
  • 对于 x 应该在哪里设置 stop_gradient?
  • 对于 out 应该在哪里设置 stop_gradient?
  • 对于 z 是否需要设置 stop_gradient?
  • 哪里算是 创建 阶段?
  • 哪里算是 组网 阶段?
  • 组网中的 stop_gradient 需要同为 True 或者同为 False 吗?

本地复现了一下concat报错,发现是旧IR下的concat_grad有问题,旧IR目前已经不再维护,PIR下无误。

OK!那我测试的时候把旧 IR 的跳过去 ~ 🤘🤘🤘

你的组网代码有一些问题,会导致concat被剪掉,反向未被加入网络中,可以参考下面的写法,PIR可以正常组网运行。 image

这里 xy 是输入,整个组网是 : xy -> out -> z(或者 c

为什么计算梯度的时候是 grads = paddle.autograd.ir_backward.grad(z, x),而不是 grads = paddle.autograd.ir_backward.grad(z, out) ?为什么要跳过 out?如果只需要 out -> z 这部分的梯度,要怎么写?

平时动态图都是这么写的,所以有点困惑,见笑了 ~~~ 😂😂😂

p.s. 如果是因为旧 IR 有问题的话,那就没问题了 ~ PIR 是不是可以写 grads = paddle.autograd.ir_backward.grad(z, out) ?

非常感谢!!!🤗🤗🤗

Copy link

paddle-ci-bot bot commented Dec 3, 2023

Sorry to inform you that 36cf6e9's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@changeyoung98
Copy link
Contributor

grads = paddle.autograd.ir_backward.grad(z, out)

是可以这么写的,之前看program concat被剪掉了感觉有点问题,没注意你是只想要后面部分的梯度😂

另外关于stop gradient的设置问题,简单来说就是在某个需要设置stop gradient的变量创建之后设置即可,

x = paddle.static.data('x', (1, 2), 'float64')

这行代码的意思就是x通过调用.data API创建了出来且被加入了当前的program中去,既是创建也是组网。
假如你需要x的stop gradient为false,那么只需要在x创建之时设置,stop gradient的值会被传递下去,out和z的值与x的一致,假如你想要中途修改该值,那么在out或z计算出来之后设置即可,不需要同为true或false。

@megemini
Copy link
Contributor Author

megemini commented Dec 5, 2023

@luotao1

关于单测里面不测试旧 ir 的梯度回传:

# not check old ir
if paddle.framework.in_pir_mode():
    # convert grad value to bool if dtype is bool
    grad_value = 123.0 if dtypes[0] != 'bool' else True
    np.testing.assert_allclose(
        res_grad, np.ones_like(y) * grad_value
    )

是因为,旧 ir 里面 unsqueeze 的梯度回传应该是有问题 ~ 以下面的测试用例来讲:

import numpy as np
import paddle
from paddle.pir_utils import test_with_pir_api

@test_with_pir_api
def test_static_api():
    paddle.enable_static()

    with paddle.static.program_guard(paddle.static.Program()):
        x = paddle.static.data('x', (2, 3), 'float64')
        x.stop_gradient = False

        feed = {'x': np.random.rand(2, 3).astype('float64')}

        out = paddle.unsqueeze(x, [0, 2])
        out.stop_gradient = False

        z = out * 123

        fetch_list = [out]
        if paddle.framework.in_pir_mode():
            grads = paddle.autograd.ir_backward.grad(z, x)
            out_grad = grads[0]
            fetch_list.append(out_grad)
        else:
            paddle.static.append_backward(z)
            out_grad = x.grad_name
            fetch_list.append(out_grad)

        exe = paddle.static.Executor()
        *res, res_grad = exe.run(feed=feed, fetch_list=fetch_list)

        print(res, res_grad)

if __name__ == '__main__':
    test_static_api()

运行后输出为:

aistudio@jupyter-942478-6602454:~/test_atleast$ python test_us_pir.py 
I1205 12:20:02.233194   679 program_interpreter.cc:212] New Executor is Running.
I1205 12:20:02.233871   679 interpreter_util.cc:624] Standalone Executor is Used.
[array([[[[0.67414195, 0.07585859, 0.11674783]],

        [[0.39899884, 0.18764049, 0.25199186]]]])] [[1.23000000e+002 0.00000000e+000 0.00000000e+000]
 [3.48520586e-316 3.45845952e-323 1.81776662e-306]]
I1205 12:20:02.243114   679 inplace_pass.cc:473] Apply inplace pass on lowering ::pir::Program to Kernel Dialect.
I1205 12:20:02.243461   679 pir_interpreter.cc:1193] New Executor is Running ...
I1205 12:20:02.243981   679 pir_interpreter.cc:1220] pir interpreter is running by multi-thread mode ...
[array([[[[0.31170326, 0.64944201, 0.60941779]],

        [[0.62743311, 0.9468603 , 0.31616582]]]])] [[123. 123. 123.]
 [123. 123. 123.]]

可以看到,@test_with_pir_api 会将内部包裹的函数运行两次,一次是旧 ir,一次是 pir ~

第二次的 pir 运行符合预期,梯度为

[[123. 123. 123.]
 [123. 123. 123.]]

而旧 ir 的梯度有问题:

[[1.23000000e+002 0.00000000e+000 0.00000000e+000]
 [3.48520586e-316 3.45845952e-323 1.81776662e-306]]

之前说过,旧 ir 不再维护,因此,单测里面只测试了 pir 的梯度回传 ~

请评审 ~ 谢谢!

Copy link
Contributor

@changeyoung98 changeyoung98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luotao1 luotao1 merged commit a353d9a into PaddlePaddle:develop Dec 5, 2023
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants