Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Layer Normalization #2213

Merged
merged 13 commits into from
Oct 20, 2020
Merged

Add Layer Normalization #2213

merged 13 commits into from
Oct 20, 2020

Conversation

arrufat
Copy link
Contributor

@arrufat arrufat commented Oct 12, 2020

This is a second step towards implementing transformers in dlib.

This PR adds Layer Normalization support to dlib.

I have added the CPU implementation and the forward GPU, but I have some doubts, so I would really appreciate if somebody could have a look to the CUDA implementation.

It runs slower than the CPU version and for big inputs (2, 3, 224, 224) results are not good, but for small inputs, it works.

Here's how I tried it, in case anyone wants to try

#include <dlib/dnn.h>

using namespace dlib;

int main() try
{

    auto print_tensor_and_avg = [](const tensor& t)
    {
        float sum = 0;
        const float* p_t = t.host();
        for (long n = 0; n < t.num_samples(); ++n)
        {
            // std::cout << "sample: " << n << '\n';
            for (long k = 0; k < t.k(); ++k)
            {
                // std::cout << "\tk: " << k << '\n';
                for (long r = 0; r < t.nr(); ++r)
                {
                    for (long c = 0; c < t.nc(); ++c)
                    {
                        // std::cout << '\t' << p_t[tensor_index(t, n, k, r, c)] << ' ';
                        sum += p_t[tensor_index(t, n, k, r, c)];
                    }
                    // std::cout << '\n';
                }
            }
        }
        std::cout << "avg: " << sum / t.size() << '\n';
    };

    resizable_tensor x(2, 3, 224, 224);
    resizable_tensor y_cpu(x), y_cuda(x);
    tt::tensor_rand rnd(0);
    rnd.fill_uniform(x);
    resizable_tensor means(x.num_samples()), invstds(x.num_samples());
    resizable_tensor gamma(x.num_samples()), beta(x.num_samples());
    gamma = 1;
    beta = 0;

    print_tensor_and_avg(x);

    float eps = 1e-5;
    running_stats<float> rs;
    cpu::layer_normalize(eps, y_cpu, means, invstds, x, gamma, beta);
    print_tensor_and_avg(y_cpu);
    for (int i = 0; i < 1000; ++i)
    {
        const auto t0 = std::chrono::steady_clock::now();
        cpu::layer_normalize(eps, y_cpu, means, invstds, x, gamma, beta);
        const auto t1 = std::chrono::steady_clock::now();
        rs.add(std::chrono::duration_cast<std::chrono::duration<float, std::micro>>(t1 - t0).count());
    }
    std::cout << "inference time: " << rs.mean() << " ± " << rs.stddev() << " µs\n";

    rs.clear();
    cuda::layer_normalize(eps, y_cuda, means, invstds, x, gamma, beta);
    print_tensor_and_avg(y_cuda);
    for (int i = 0; i < 1000; ++i)
    {
        const auto t0 = std::chrono::steady_clock::now();
        cuda::layer_normalize(eps, y_cuda, means, invstds, x, gamma, beta);
        const auto t1 = std::chrono::steady_clock::now();
        rs.add(std::chrono::duration_cast<std::chrono::duration<float, std::micro>>(t1 - t0).count());
    }
    std::cout << "inference time: " << rs.mean() << " ± " << rs.stddev() << " µs\n";

    std::cout << "difference: " << sum(squared(mat(y_cpu) - mat(y_cuda))) << '\n';

    return EXIT_SUCCESS;
}
catch (const std::exception& e)
{
    std::cout << e.what() << '\n';
    return EXIT_FAILURE;
}

With a size of (2, 3, 4, 5), I get this:

// cpu
avg: 7.05322e-08
inference time: 29.3967 ± 0.475304 µs
// cuda
avg: 6.25849e-08
inference time: 49.7305 ± 1.50192 µs
difference: 2.39537e-12

But with a size of (2, 3, 224, 224):

// cpu
avg: -4.32392e-06
inference time: 777.195 ± 16.5298 µs
// cuda
avg: -379502
inference time: 4188.05 ± 267.693 µs
difference: nan

Thanks in advance for any help

@arrufat arrufat marked this pull request as draft October 12, 2020 16:32
dlib/cuda/cuda_dlib.cu Outdated Show resolved Hide resolved
@arrufat arrufat marked this pull request as ready for review October 13, 2020 08:23
@arrufat
Copy link
Contributor Author

arrufat commented Oct 13, 2020

Ready for review!
In this implementation, these are the results for a (2, 3, 224, 224)-sized tensor

 cpu forward time: 780.407 ± 12.5129 µs
cuda forward time: 87.3476 ± 2.33944 µs

 cpu backward time: 3706.97 ± 45.1868 µs
cuda backward time: 448.009 ± 31.0569 µs

dlib/test/dnn.cpp Outdated Show resolved Hide resolved
@davisking
Copy link
Owner

Cool, looks good. Thanks for another PR :)

Rename that function you noted and updated the docs and add that one test and this is good to go :)

@davisking
Copy link
Owner

Nice, looks good :)

@arrufat arrufat mentioned this pull request Jan 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants