Add Layer Normalization #2213

arrufat · 2020-10-12T16:32:42Z

This is a second step towards implementing transformers in dlib.

This PR adds Layer Normalization support to dlib.

I have added the CPU implementation and the forward GPU, but I have some doubts, so I would really appreciate if somebody could have a look to the CUDA implementation.

It runs slower than the CPU version and for big inputs (2, 3, 224, 224) results are not good, but for small inputs, it works.

Here's how I tried it, in case anyone wants to try

#include <dlib/dnn.h>

using namespace dlib;

int main() try
{

    auto print_tensor_and_avg = [](const tensor& t)
    {
        float sum = 0;
        const float* p_t = t.host();
        for (long n = 0; n < t.num_samples(); ++n)
        {
            // std::cout << "sample: " << n << '\n';
            for (long k = 0; k < t.k(); ++k)
            {
                // std::cout << "\tk: " << k << '\n';
                for (long r = 0; r < t.nr(); ++r)
                {
                    for (long c = 0; c < t.nc(); ++c)
                    {
                        // std::cout << '\t' << p_t[tensor_index(t, n, k, r, c)] << ' ';
                        sum += p_t[tensor_index(t, n, k, r, c)];
                    }
                    // std::cout << '\n';
                }
            }
        }
        std::cout << "avg: " << sum / t.size() << '\n';
    };

    resizable_tensor x(2, 3, 224, 224);
    resizable_tensor y_cpu(x), y_cuda(x);
    tt::tensor_rand rnd(0);
    rnd.fill_uniform(x);
    resizable_tensor means(x.num_samples()), invstds(x.num_samples());
    resizable_tensor gamma(x.num_samples()), beta(x.num_samples());
    gamma = 1;
    beta = 0;

    print_tensor_and_avg(x);

    float eps = 1e-5;
    running_stats<float> rs;
    cpu::layer_normalize(eps, y_cpu, means, invstds, x, gamma, beta);
    print_tensor_and_avg(y_cpu);
    for (int i = 0; i < 1000; ++i)
    {
        const auto t0 = std::chrono::steady_clock::now();
        cpu::layer_normalize(eps, y_cpu, means, invstds, x, gamma, beta);
        const auto t1 = std::chrono::steady_clock::now();
        rs.add(std::chrono::duration_cast<std::chrono::duration<float, std::micro>>(t1 - t0).count());
    }
    std::cout << "inference time: " << rs.mean() << " ± " << rs.stddev() << " µs\n";

    rs.clear();
    cuda::layer_normalize(eps, y_cuda, means, invstds, x, gamma, beta);
    print_tensor_and_avg(y_cuda);
    for (int i = 0; i < 1000; ++i)
    {
        const auto t0 = std::chrono::steady_clock::now();
        cuda::layer_normalize(eps, y_cuda, means, invstds, x, gamma, beta);
        const auto t1 = std::chrono::steady_clock::now();
        rs.add(std::chrono::duration_cast<std::chrono::duration<float, std::micro>>(t1 - t0).count());
    }
    std::cout << "inference time: " << rs.mean() << " ± " << rs.stddev() << " µs\n";

    std::cout << "difference: " << sum(squared(mat(y_cpu) - mat(y_cuda))) << '\n';

    return EXIT_SUCCESS;
}
catch (const std::exception& e)
{
    std::cout << e.what() << '\n';
    return EXIT_FAILURE;
}

With a size of (2, 3, 4, 5), I get this:

// cpu
avg: 7.05322e-08
inference time: 29.3967 ± 0.475304 µs
// cuda
avg: 6.25849e-08
inference time: 49.7305 ± 1.50192 µs
difference: 2.39537e-12

But with a size of (2, 3, 224, 224):

// cpu
avg: -4.32392e-06
inference time: 777.195 ± 16.5298 µs
// cuda
avg: -379502
inference time: 4188.05 ± 267.693 µs
difference: nan

Thanks in advance for any help

dlib/cuda/cuda_dlib.cu

dlib/dnn/layers.h

arrufat · 2020-10-13T08:25:33Z

Ready for review!
In this implementation, these are the results for a (2, 3, 224, 224)-sized tensor

 cpu forward time: 780.407 ± 12.5129 µs
cuda forward time: 87.3476 ± 2.33944 µs

 cpu backward time: 3706.97 ± 45.1868 µs
cuda backward time: 448.009 ± 31.0569 µs

dlib/test/dnn.cpp

davisking · 2020-10-17T23:31:20Z

Cool, looks good. Thanks for another PR :)

Rename that function you noted and updated the docs and add that one test and this is good to go :)

dlib/dnn/layers.h

dlib/test/dnn.cpp

davisking · 2020-10-20T11:56:51Z

Nice, looks good :)

examples/dnn_dcgan_train_ex.cpp

arrufat and others added 3 commits October 11, 2020 17:05

wip: layer normalization on cpu

33e5969

wip: add cuda implementation, nor working yet

d135e19

wip: try to fix cuda implementation

efb13b1

arrufat marked this pull request as draft October 12, 2020 16:32

davisking reviewed Oct 12, 2020

View reviewed changes

dlib/cuda/cuda_dlib.cu Outdated Show resolved Hide resolved

arrufat added 4 commits October 13, 2020 10:56

swap grid_strid_range and grid_strid_range_y: does not work yet

0bb3ceb

fix CUDA implementation

0d6276c

implement cuda gradient

f61b84e

add documentation, move layer_norm, update bn_visitor

a85779c

arrufat commented Oct 13, 2020

View reviewed changes

dlib/dnn/layers.h Show resolved Hide resolved

add tests

d28a9bd

arrufat marked this pull request as ready for review October 13, 2020 08:23

use stddev instead of variance in test (they are both 1, anyway)

7896aa4

davisking reviewed Oct 17, 2020

View reviewed changes

dlib/test/dnn.cpp Outdated Show resolved Hide resolved

davisking reviewed Oct 17, 2020

View reviewed changes

dlib/test/dnn.cpp Show resolved Hide resolved

arrufat added 2 commits October 18, 2020 22:02

add test for means and invstds on CPU and CUDA

f0feb3d

rename visitor to disable_duplicative_bias

4a82abf

arrufat commented Oct 19, 2020

View reviewed changes

dlib/dnn/layers.h Show resolved Hide resolved

arrufat added 2 commits October 20, 2020 11:50

handle more cases in the visitor_disable_input_bias

09be870

Add tests for visitor_disable_input_bias

c7f74b7

arrufat commented Oct 20, 2020

View reviewed changes

dlib/dnn/layers.h Show resolved Hide resolved

arrufat commented Oct 20, 2020

View reviewed changes

dlib/test/dnn.cpp Show resolved Hide resolved

davisking merged commit 3c82c22 into davisking:master Oct 20, 2020

arrufat mentioned this pull request Oct 20, 2020

Using set_all_bn_inputs_no_bias with a repeat layer as input. #2216

Closed

arrufat commented Nov 23, 2020

View reviewed changes

examples/dnn_dcgan_train_ex.cpp Show resolved Hide resolved

This was referenced Jan 13, 2022

Add support for grouped convolutions #2485

Draft

Issue with the LayerNorm implementation #2486

Closed

arrufat mentioned this pull request Jan 16, 2022

Fix Layer Normalize #2489

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Layer Normalization #2213

Add Layer Normalization #2213

arrufat commented Oct 12, 2020

arrufat commented Oct 13, 2020

davisking commented Oct 17, 2020

davisking commented Oct 20, 2020

Add Layer Normalization #2213

Add Layer Normalization #2213

Conversation

arrufat commented Oct 12, 2020

arrufat commented Oct 13, 2020

davisking commented Oct 17, 2020

davisking commented Oct 20, 2020