09 Oct 01:08

panyx0718

627bea4

PaddlePaddle 1.0.0

Release Log

Major New Features and Improvements:

Support MacOS training, inference, Windows inference (Alpha).
Speed up While operator
Enhance support for sparse tensor
TensorRT integration enhance
More fused operators for CPU inference: GRU, LSTM, etc.
Some improvements for sequence operators (sequence_pool, sequence_concat, sequence_mask, sequence_enumerate, sequence_slice, etc)
Other operator improvements: stack_op, BatchAUC, prelude, crf, pad2d
decayed_adagrad support for distributed training
Python multi-process reader
API doc improvements. Avoid kwargs.

Others:

Tighten public APIs. Hide public APIs that are currently not widely used and unlikely to be used in the near future.
Clean up some deprecated features.

Known Issues

Memory optimization still has space for improvements in next release.
Using memory optimization with distributed training should strictly follow some counter-intuitive instructions.
Sparse Tensor (SelectedRows)'s is not handled correctly in some operators and is being fixed in the next release

发布日志

主要新功能和优化

支持 MacOS 训练和预测，Windows预测（内测）
提高while operator的速度
增强对sparse tensor的支持
TensorRT 集成的加强
更多CPU预测的融合operator: GRU, LSTM, etc.
优化序列相关operators (sequence_pool, sequence_concat, sequence_mask, sequence_enumerate, sequence_slice, etc)
其他operator的优化: stack_op, BatchAUC, prelude, crf, pad2d
decayed_adagrad 支持分布式训练
Python多进程reader
API 文档优化，避免kwargs等问题

其他:

规范管理public API. 一些当前不常被使用并且将来不太可能被使用的API被隐藏起来
清理一些废弃的功能

已知问题

内存优化在下个release还有一些的提高空间
内存优化和分布式训练的同时使用需要严格遵循一些不太合乎直觉的步骤
Sparse Tensor (SelectedRows)'s 在一些operators里面没有被正确的处理，在下个release中会被修复。

Assets 2

25 Sep 10:45

panyx0718

v1.0.0-rc0

644bad1

PaddlePaddle 1.0.0-rc0 Pre-release

Pre-release

Release Log

Major New Features and Improvements:

Support MacOS training, inference, Windows inference (Alpha).
Speed up While operator
Enhance support for sparse tensor
TensorRT integration enhance
More fused operators for CPU inference: GRU, LSTM, etc.
Some improvements for sequence operators (sequence_pool, sequence_concat, sequence_mask, sequence_enumerate, sequence_slice, etc)
Other operator improvements: stack_op, BatchAUC, prelude, crf, pad2d
decayed_adagrad support for distributed training
Python multi-process reader
API doc improvements. Avoid kwargs.

Others:

Tighten public APIs. Hide public APIs that are currently not widely used and unlikely to be used in the near future.
Clean up some deprecated features.

Known Issues

Memory optimization still has space for improvements in next release.
Using memory optimization with distributed training should strictly follow some counter-intuitive instructions.

发布日志

主要新功能和优化

支持 MacOS 训练和预测，Windows预测（内测）
提高while operator的速度
增强对sparse tensor的支持
TensorRT 集成的加强
更多CPU预测的融合operator: GRU, LSTM, etc.
优化序列相关operators (sequence_pool, sequence_concat, sequence_mask, sequence_enumerate, sequence_slice, etc)
其他operator的优化: stack_op, BatchAUC, prelude, crf, pad2d
decayed_adagrad 支持分布式训练
Python多进程reader
API 文档优化，避免kwargs等问题

其他:

规范管理public API. 一些当前不常被使用并且将来不太可能被使用的API被隐藏起来
清理一些废弃的功能

已知问题

内存优化在下个release还有一些的提高空间
内存优化和分布式训练的同时使用需要严格遵循一些不太合乎直觉的步骤

Assets 2

05 Sep 03:46

panyx0718

v0.15.0

1ca241c

PaddlePaddle 0.15.0

Release Log

Major New Features and Improvements:

PyReader. Support python-level customized data loading and preprocessing for the buffered reader.
Unified Intermediate Representation (IR) and transforms for single-machine, distributed training and inference.
Python3 early support. (Alpha testing)
Inference library symbol hiding. Better isolation with other libraries linked together.
Distributed lookup table training with parallel executor. Allow to scale distributed training for large scale sparse dataset. (Alpha testing)
Major stability improvements and test coverage improvements of distributed training.
Polish high frequency enforce error message. Enhance user usability.
Profiler improvements for dist_train and fixes.
Operator improvements: mkldnn softmax_grad, squeeze_op, hsigmoid_op, Sampling id, beam_search, flatten_op, rank_loss_op, prior_box_op, bilinear initializer, squeeze/unsqueeze, maxout
Major expansion of TensorRT inference support.
Continuous Integration and Evaluation service scale and stability improvements
Hide many public APIs that shouldn't be exposed.

Performance:

layer_norm speedup: forward: 0.52ms -> 0.16ms (average) backward: 1.08ms -> 0.41ms (average)
conv_grad mkldnn speedup, fc, gru cpu improvements.
reduce_sum cpu kernel speedup: 4 times
softmax_with_cross_entropy op is as followings: Forward: 52.4ms -> 15.6ms
OCR CPU model speed improvements. Enhanced im2col and some OCR model performance on 2620v3 improved 34.6%
depthwise conv2d_transposed speed up. Improved face detection model by 16.5%.

Others:

Added external dependencies: xbyak, cub, libxsmm
Merge libpaddle_inference_api[.a/.so] into libpaddle_fluid[.a/.so]. Inference only need to link libpaddle_fluid[.a/.so].
Fixes of float16 support
Significantly reduce fluid.tgz package size. GPU version reduced from 730M to 190M. CPU version reduced from 335M to 77M.

Known Issues

Using memory_optimize with distributed training might trigger subtle bugs. We are aiming to fix it in the next release.

发布日志

主要新功能和优化

PyReader. 支持python自定义数据的读取和预处理，然后发送给带buffer的reader
单机，多机和预测都使用统一的中间表达和转换。
Python3的支持（内测）
预测库更好的symbol隐藏，更好的和其他的依赖库进行隔离。
支持分布式的lookup table。可以支持训练是的大规模稀疏。(内测）
分布式训练的显著稳定性提升和测试覆盖提升。
提高报错信息的可读性。
Profile对分布式的支持和修复
新增算子：mkldnn softmax_grad, squeeze_op, hsigmoid_op, Sampling id, beam_search, flatten_op, rank_loss_op, prior_box_op, bilinear initializer, squeeze/unsqueeze, maxout
对TensorRT支持的扩展，支持更多的TensorRT算子。
持续集成测试系统规模的稳定性的提升
隐藏了大量不应该暴露的public API，增强public API的严谨性。

性能:

layer_norm前向加速：0.52ms -> 0.16ms (average)，反向加速：backward: 1.08ms -> 0.41ms (average)
conv_grad mkldnn 加速, fc, gru cpu 上优化。
reduce_sum cpu上4倍提速
softmax_with_cross_entropy提速52.4ms -> 15.6ms
OCR CPU模型性能提升，改进im2col实现，增强了conv的执行效率，使得OCR模型在2620v3上取得34.6%的性能提升。
conv2d_transposed_op支持设置Group，并且加速depthwise conv2d_transposed，该加速使得人脸检测模型速度提升16.5%

其他:

新增第三方库：xbyak, cub, libxsmm
将 libpaddle_inference_api[.a/.so] 合并到 libpaddle_fluid[.a/.so]，预测只需要链接 libpaddle_fluid[.a/.so]
float16的修复
大幅减少发布的fluid.tgz包大小，gpu版本从730M降低为190M，cpu版本从335M降低为77M，加快用户下载。

已知问题

memory_optimize 在分布式的时候会触发bug，我们会在下一个版本修复。

Assets 2

03 Sep 03:23

panyx0718

v0.15.0-rc0

64d48f4

PaddlePaddle 0.15.0-rc0 Pre-release

Pre-release

Release Log

Major New Features and Improvements:

PyReader. Support python-level customized data loading and preprocessing for the buffered reader.
Unified Intermediate Representation (IR) and transforms for single-machine, distributed training and inference.
Python3 early support. (Alpha testing)
Inference library symbol hiding. Better isolation with other libraries linked together.
Distributed lookup table training with parallel executor. Allow to scale distributed training for large scale sparse dataset. (Alpha testing)
Major stability improvements and test coverage improvements of distributed training.
Polish high frequency enforce error message. Enhance user usability.
Profiler improvements for dist_train and fixes.
Operator improvements: mkldnn softmax_grad, squeeze_op, hsigmoid_op, Sampling id, beam_search, flatten_op, rank_loss_op, prior_box_op, bilinear initializer, squeeze/unsqueeze, maxout
Major expansion of TensorRT inference support.
Continuous Integration and Evaluation service scale and stability improvements
Hide many public APIs that shouldn't be exposed.

Performance:

layer_norm speedup: forward: 0.52ms -> 0.16ms (average) backward: 1.08ms -> 0.41ms (average)
conv_grad mkldnn speedup, fc, gru cpu improvements.
reduce_sum cpu kernel speedup: 4 times
softmax_with_cross_entropy op is as followings: Forward: 52.4ms -> 15.6ms
OCR CPU model speed improvements. Enhanced im2col and some OCR model performance on 2620v3 improved 34.6%
depthwise conv2d_transposed speed up. Improved face detection model by 16.5%.

Others:

Added external dependencies: xbyak, cub, libxsmm
Merge libpaddle_inference_api[.a/.so] into libpaddle_fluid[.a/.so]. Inference only need to link libpaddle_fluid[.a/.so].
Fixes of float16 support
Significantly reduce fluid.tgz package size. GPU version reduced from 730M to 190M. CPU version reduced from 335M to 77M.

发布日志

主要新功能和优化

PyReader. 支持python自定义数据的读取和预处理，然后发送给带buffer的reader
单机，多机和预测都使用统一的中间表达和转换。
Python3的支持（内测）
预测库更好的symbol隐藏，更好的和其他的依赖库进行隔离。
支持分布式的lookup table。可以支持训练是的大规模稀疏。(内测）
分布式训练的显著稳定性提升和测试覆盖提升。
提高报错信息的可读性。
Profile对分布式的支持和修复
新增算子：mkldnn softmax_grad, squeeze_op, hsigmoid_op, Sampling id, beam_search, flatten_op, rank_loss_op, prior_box_op, bilinear initializer, squeeze/unsqueeze, maxout
对TensorRT支持的扩展，支持更多的TensorRT算子。
持续集成测试系统规模的稳定性的提升
隐藏了大量不应该暴露的public API，增强public API的严谨性。

性能:

layer_norm前向加速：0.52ms -> 0.16ms (average)，反向加速：backward: 1.08ms -> 0.41ms (average)
conv_grad mkldnn 加速, fc, gru cpu 上优化。
reduce_sum cpu上4倍提速
softmax_with_cross_entropy提速52.4ms -> 15.6ms
OCR CPU模型性能提升，改进im2col实现，增强了conv的执行效率，使得OCR模型在2620v3上取得34.6%的性能提升。
conv2d_transposed_op支持设置Group，并且加速depthwise conv2d_transposed，该加速使得人脸检测模型速度提升16.5%

其他:

新增第三方库：xbyak, cub, libxsmm
将 libpaddle_inference_api[.a/.so] 合并到 libpaddle_fluid[.a/.so]，预测只需要链接 libpaddle_fluid[.a/.so]
float16的修复
大幅减少发布的fluid.tgz包大小，gpu版本从730M降低为190M，cpu版本从335M降低为77M，加快用户下载。

Assets 2

03 Jul 08:29

panyx0718

v0.14.0

163b5e5

v0.14.0

Release Log

Major Features

Enhanced the inference library. Better memory buffer. Added several demos.
Inference library added support for Anakin engine, TensorRT engine.
ParallelExecutor supports multi-threaded CPU training. (In addition to multi-GPU training)
Added mean IOU operator, argsort operator, etc. Improved L2norm operator. Added crop API.
Released pre-trained ResNet50, Se-Resnext50, AlexNet, etc, Enahanced Transformer, etc.
New data augmentation operators.
Major documentation and API comment improvements.
Enhance the continuous evaluation system.

Performance Improvements

More overlap of distributed training network operation with computation. ~10% improvements
CPU performance improvements with more MKLDNN support.

Major Bug Fixes

Fix memory leak issues.
Fix concat operator.
Fix ParallelExecutor input data memcpy issue.
Fix ParallelExecutor deadlock issue.
Fix distributed training client timeout.
Fix distributed training pserver side learning rate decay.
Thread-safe Scope implementation.
Fix some issue using memory optimizer and parallelexecutor together.

Known Issues

IfElse has some bugs.
BatchNorm is not stable if batch_size=1

Assets 2

05 Jun 07:34

panyx0718

v0.13.0

9d40eb3

v0.13.0

Release Log

Major Features

Asynchronous distributed training support.
Distributed training with ParallelExecutor.
Distributed ring-based training with NCCL2.
Support checkpoint save on trainer and store on trainer and parameter server.
Graceful shutdown of parameter server.
Publish the high-level inference lib API and inference implementation.
Assign roles to each op.
Publish the C++ train API to allow to embed fluid into other C++ systems.
Support uint8_t type data file and data exchange.
C++ reader supports customized data augmentation.
Improved operator and interface support for speech models.
New random_crop op.
New shape op to get the tensor's shape.
New resize_bilinear interface.
New dice_loss layer.
Enhanced reduce_op to support reduce on multiple dimensions.

Performance Improvements

On P40 GPU ResNet-50 model, single GPU speed improves 23.8% (105 images/sec to 130 images/sec). 8 GPUs speedup ratio 6, 32 GPUs speedup ratio reaches 17.4.

Overlap send/recv op with other operators.
Multi-thread server-side request handling.
Weight decay and clipping moved from trainer to parameter server for performance and correctness.
Improved C++ reader.

Major Bug Fixes

Fix accuracy loss when both ParallelExecutor and memory optimizer are used.
Fix ParallelExecutor hang when multiple inputs duplicate.
Fix Program clone cause memory leak.
Fix GRU unit bias ineffective and wrong activation.
Fix ROI Pooling GPU computation issues.
Fix fill_constant_batch_size_like when input is sequence.
Fix reshape op.

Assets 2

26 Apr 11:48

panyx0718

v0.12.0

c816121

v0.12.0

Release log

Major Improvements

Reader Prototype. Data can be read through C++ reader asynchronously with potentially higher performance.

ParallelExecutor. Significantly improve the multi-gpu performance over the previous solution.

Distributed Training. Major performance improvements and stability improvements.

Inplace Activation. Significantly reduce the GPU memory requirements and increase the batch size.

Operator Optimizations. Performance improvements of many operators.

Timeline Profiling. Allow to visualize performance as time series.

Major Bug Fixes

Calling cublas/cudnn library with wrong argument types.

Evaluated Models

Image Classification

Object Detection

OCR

Machine Translation

Text Classification

Language Model

Sequence Tagging

Assets 4

13 Mar 08:07

reyoung

v0.11.1a2

1f757f5

0.11.1a2 Pre-release

Pre-release

This release is a weekly alpha version of PaddlePaddle. It should be only used for internal tests. This is not a production-ready version.

Release log

Performance gain and memory optimization

Config and Env:

model: SE-ResNet-150
Input: 3 x 224 x 224
batch_size: 25
CentOS 6.3, Tesla P40, single card.

The comparison results before optimization:

	Speed	Memory
Fluid(before)	1.95 sec/iter	18341 MB
PyTorch	1.154 sec/iter	13359 MB
Fluid/PyTorch	1.6898	1.3729

After optimizing the speed:

	Speed	Memory
Fluid(opti_speed)	1.45 sec/iter	17222 MB
PyTorch	1.154 sec/iter	13359 MB
Fluid/PyTorch	1.2565	1.2892

After optimizing the memory usage:

	Speed	Memory
Fluid(opti_mem)	1.93 sec/iter	14388 MB
PyTorch	1.154 sec/iter	13359 MB
Fluid/PyTorch	1.6724	1.0770

Overall performance gain.
- Details issue: #8990
Delete GPU memory while training.
[WIP] Feed data from C++
- Add basic RecordIO API
- Polish C++ Reader operators
- Add DoubleBuffer Reader

Distributed training

now support distributed sparse update
[WIP] send recv using zerocopy grpc transfer

Assets 2

09 Dec 03:11

jacquesqiao

v0.11.0

6332b82

v0.11.0

Fluid

Release v0.11.0 includes a new feature PaddlePaddle Fluid. Fluid is designed to allow users to program like PyTorch and TensorFlow Eager Execution. In these systems, there is no longer the concept model and applications do not include a symbolic description of a graph of operators nor a sequence of layers. Instead, applications look exactly like a usual program that describes a process of training or inference. The difference between Fluid and PyTorch or Eager Execution is that Fluid doesn't rely on Python's control-flow, if-then-else nor for. Instead, Fluid provides its C++ implementations and their Python binding using the with statement. For an example

https://github.com/PaddlePaddle/Paddle/blob/3df78ed2a98d37f7ae6725894cc7514effd5664b/python/paddle/v2/fluid/tests/test_while_op.py#L36-L44

In v0.11.0, we provides a C++ class Executor to run a Fluid program. Executor works like an interpreter. In future version, we will improve Executor into a debugger like GDB, and we might provide some compilers, which, for example, takes an application like the above one, and outputs an equivalent C++ source program, which can be compiled using nvcc to generate binaries that use CUDA, or using icc to generate binaries that make full use of Intel CPUs.

New Features

Release Fluid.
Add C-API for model inference
Use fluid API to create a simple GAN demo.
Add develop guide about performance tunning.
Add retry when download paddle.v2.dataset.
Linking protobuf-lite not protobuf in C++. Reduce the binary size.
Feature Elastic Deep Learning (EDL) released.
A new style cmake functions for Paddle. It is based on Bazel API.
Automatically download and compile with Intel® MKLML library as CBLAS when build WITH_MKL=ON.
Intel® MKL-DNN on PaddlePaddle:
- Complete 11 MKL-DNN layers: Convolution, Fully connectivity, Pooling, ReLU, Tanh, ELU, Softmax, BatchNorm, AddTo, Concat, LRN.
- Complete 3 MKL-DNN networks: VGG-19, ResNet-50, GoogleNet
- Benchmark on Intel Skylake 6148 CPU: 2~3x training speedup compared with MKLML.
Add the softsign activation.
Add the dot product layer.
Add the L2 distance layer.
Add the sub-nested sequence layer.
Add the kmax sequence score layer.
Add the sequence slice layer.
Add the row convolution layer
Add mobile friendly webpages.

Improvements

Build and install using a single whl package.
Custom evaluating in V2 API.
Change PADDLE_ONLY_CPU to PADDLE_WITH_GPU, since we will support many kinds of devices.
Remove buggy BarrierStat.
Clean and remove unused functions in paddle::Parameter.
Remove ProtoDataProvider.
Huber loss supports both regression and classification.
Add the stride parameter for sequence pooling layers.
Enable v2 API use cudnn batch normalization automatically.
The BN layer's parameter can be shared by a fixed the parameter name.
Support variable-dimension input feature for 2D convolution operation.
Refine cmake about CUDA to automatically detect GPU architecture.
Improved website navigation.

Bug Fixes

Fix bug in ROI pooling. cc9a761
Fix AUC is zero when label is dense vector. #5274
Fix bug in WarpCTC layer.

Assets 2

10 May 09:26

gangliao

v0.10.0

2c98bec

v0.10.0

Release v0.10.0

Please pull the official images from docker hub.

We are glad to release version 0.10.0. In this version, we are happy to release the new
Python API.

Our old Python API is kind of out of date. It's hard to learn and hard to
use. To write a PaddlePaddle program using the old API, we'd have to write
at least two Python files: one data provider and another one that defines
the network topology. Users start a PaddlePaddle job by running the
paddle_trainer C++ program, which calls Python interpreter to run the
network topology configuration script and then start the training loop,
which iteratively calls the data provider function to load minibatches.
This prevents us from writing a Python program in a modern way, e.g., in the
Jupyter Notebook.
The new API, which we often refer to as the v2 API, allows us to write
much shorter Python programs to define the network and the data in a single
.py file. Also, this program can run in Jupyter Notebook, since the entry
point is in Python program and PaddlePaddle runs as a shared library loaded
and invoked by this Python program.

Basing on the new API, we delivered an online interative book, Deep Learning 101
and its Chinese version.

We also worked on updating our online documentation to describe the new API.
But this is an ongoing work. We will release more documentation improvements
in the next version.

We also worked on bring the new API to distributed model training (via MPI and
Kubernetes). This work is ongoing. We will release more about it in the next
version.

New Features

We release new Python API.
Deep Learning 101 book in English and Chinese.
Support rectangle input for CNN.
Support stride pooling for seqlastin and seqfirstin.
Expose seq_concat_layer/seq_reshape_layer in trainer_config_helpers.
Add dataset package: CIFAR, MNIST, IMDB, WMT14, CONLL05, movielens, imikolov.
Add Priorbox layer for Single Shot Multibox Detection.
Add smooth L1 cost.
Add data reader creator and data reader decorator for v2 API.
Add the CPU implementation of cmrnorm projection.

Improvements

Support Python virtualenv for paddle_trainer.
Add pre-commit hooks, used for automatically format our code.
Upgrade protobuf to version 3.x.
Add an option to check data type in Python data provider.
Speedup the backward of average layer on GPU.
Documentation refinement.
Check dead links in documents using Travis-CI.
Add a example for explaining sparse_vector.
Add ReLU in layer_math.py
Simplify data processing flow for Quick Start.
Support CUDNN Deconv.
Add data feeder in v2 API.
Support predicting the samples from sys.stdin for sentiment demo.
Provide multi-proccess interface for image preprocessing.
Add benchmark document for v1 API.
Add ReLU in layer_math.py.
Add packages for automatically downloading public datasets.
Rename Argument::sumCost to Argument::sum since class Argument is nothing with cost.
Expose Argument::sum to Python
Add a new TensorExpression implementation for matrix-related expression evaluations.
Add lazy assignment for optimizing the calculation of a batch of multiple expressions.
Add abstract calss Function and its implementation:
- PadFunc and PadGradFunc.
- ContextProjectionForwardFunc and ContextProjectionBackwardFunc.
- CosSimBackward and CosSimBackwardFunc.
- CrossMapNormalFunc and CrossMapNormalGradFunc.
- MulFunc.
Add class AutoCompare and FunctionCompare, which make it easier to write unit tests for comparing gpu and cpu version of a function.
Generate libpaddle_test_main.a and remove the main function inside the test file.
Support dense numpy vector in PyDataProvider2.
Clean code base, remove some copy-n-pasted code snippets:
- Extract RowBuffer class for SparseRowMatrix.
- Clean the interface of GradientMachine.
- Use override keyword in layer.
- Simplify Evaluator::create, use ClassRegister to create Evaluators.
Check MD5 checksum when downloading demo's dataset.
Add paddle::Error which intentially replace LOG(FATAL) in Paddle.

Bug Fixes

Check layer input types for recurrent_group.
Don't run clang-format with .cu source files.
Fix bugs with LogActivation.
Fix the bug that runs test_layerHelpers multiple times.
Fix the bug that the seq2seq demo exceeds protobuf message size limit.
Fix the bug in dataprovider converter in GPU mode.
Fix a bug in GatedRecurrentLayer.
Fix bug for BatchNorm when testing more than one models.
Fix broken unit test of paramRelu.
Fix some compile-time warnings about CpuSparseMatrix.
Fix MultiGradientMachine error when trainer_count > batch_size.
Fix bugs that prevents from asynchronous data loading in PyDataProvider2.

Assets 2

Releases: PaddlePaddle/Paddle

PaddlePaddle 1.0.0

Release Log

Major New Features and Improvements:

Others:

Known Issues

发布日志

主要新功能和优化

其他:

已知问题

PaddlePaddle 1.0.0-rc0

Release Log

Major New Features and Improvements:

Others:

Known Issues

发布日志

主要新功能和优化

其他:

已知问题

PaddlePaddle 0.15.0

Release Log

Major New Features and Improvements:

Performance:

Others:

Known Issues

发布日志

主要新功能和优化

性能:

其他:

已知问题

PaddlePaddle 0.15.0-rc0

Release Log

Major New Features and Improvements:

Performance:

Others:

发布日志

主要新功能和优化

性能:

其他:

v0.14.0

Release Log

Major Features

Performance Improvements

Major Bug Fixes

Known Issues

v0.13.0

Release Log

Major Features

Performance Improvements

On P40 GPU ResNet-50 model, single GPU speed improves 23.8% (105 images/sec to 130 images/sec). 8 GPUs speedup ratio 6, 32 GPUs speedup ratio reaches 17.4.

Major Bug Fixes

v0.12.0

Release log

Major Improvements

Major Bug Fixes

Evaluated Models

0.11.1a2

Release log

Performance gain and memory optimization

Config and Env:

Distributed training

v0.11.0

Fluid

New Features

Improvements

Bug Fixes

v0.10.0

Release v0.10.0

New Features

Improvements

Bug Fixes