Performance #1

faroit · 2016-04-18T11:40:46Z

Compared to numpy.einsum, Einsum.jl seems to be quite slow. I wonder if there is some room for improvements on this side:

numpy

import numpy as np
import timeit

# P = np.random.random((20, 15, 100, 30, 2))

A = np.random.random((20, 15, 100, 5))
H = np.random.random((30, 5))
C = np.random.random((2, 5))


def run():
    return np.einsum('abfj,tj,cj->abftc', A, H, C)

times = timeit.Timer(run).timeit(number=10)

print times

Elapsed time: 1.44563603401s

Julia

using Einsum

P = zeros(20,15,100,30,2); 

A = randn(20,15,100,5);
H = randn(30,5);
C = randn(2,5);

tic()
for i = 1:10
  @einsum P[a,b,f,t,c] = A[a,b,f,j]*H[t,j]*C[c,j]
end
toc()

elapsed time: 85.405141333 seconds

The text was updated successfully, but these errors were encountered:

Jutho · 2016-04-18T14:34:54Z

Timing correctly (i.e. putting the code in a function, not counting the first run because of compilation/warm-up/..., see Julia manual) brings this down to approximately 2s per run on my machine, i.e. about 20s to run this 10 times. Still slower than python. (Is Python reporting the time per run or the time for the total number of 10 runs?).

Inspecting the resulting function with @code_warntype shows that the accumulation variable s generated by the macro is type-unstable, i.e. not inferred correctly. The reason is that it is initialized as 0 (an Int64) and then the different contributions are added to it, which in this example are of type Float64. Quickly changing the code to s=zero($lhs) shows this running in about 0.3s for the 10 runs, or thus about 0.03s for one run.

However, a more sensible initialization of s is required, which depends on the type on the left hand side (or on the type of the evaluated right hand side).

ahwillia · 2016-04-18T20:23:38Z

Yes sorry this was on my to do list. Just pushed some commits (0968d13) and it is much faster now for me.

https://github.com/ahwillia/Einsum.jl#benchmarks

If this works for you both then I'll close this issue.

faroit · 2016-04-19T16:43:10Z

I tried your new implementation. While there is a significant speedup compared to the older version it still depends on the actual summation.

I therefore created some small benchmarks scripts to compare the einsum methods. I've also included opt-einsum which is an optimized numpy einsum package.

The results show the measured averaged time in seconds for 1 run
n is a parameter which influences the tensor dimension but is different for each sum.
The results show that for the simple 2D parafac the differences are very small
for 5D parafac (like the one you have in https://github.com/ahwillia/Einsum.jl#benchmarks) Einsum.jl is faster than numpy but not as fast as the opt_einsum.
On the commonfate index sum Einsum.jl is actually the slowest, hence there is probably more room for improvements.

@Jutho feel free to comment on the measured timings or send me a PR in the benchmark repo

ahwillia · 2016-04-19T17:07:37Z

Wow - nice work! I didn't know about the opt_einsum package - the documentation on it is quite good. I will give it a careful read when I get the time. Would you mind opening a PR here to include your more rigorous benchmarks?

Any ideas on why we lose out on the commonfate benchmark? Switching the order of the indices H[t,j] -> H[j,t] and C[c,j] -> C[j,c] should improve our speed (less cache misses). Though I'd assume numpy would be facing the same problem.

Looks like beating opt_einsum will take a decent (but feasible!) amount of extra code.

Edit: Two nice stackoverflow questions that are relevant: Q1, Q2

ahwillia · 2016-07-07T20:31:07Z

A brief update: adding @inbounds and @simd leads to a roughly 2x speed-up on the CPD benchmark. The common fate benchmark is still slower than I'd like. My guess is that we have to be clever about the order of the for loops.

Even better would be to follow opt_einsum and figure out intermediate solutions. This is outside of my bandwidth at the moment, but all input and PRs are welcome. I'll tag a new release somewhat soon after I do more testing to make sure this doesn't break anything.

mihirparadkar · 2017-04-18T17:21:11Z

Is it possible for the package to use TensorOperations' @tensor when possible? It seems like the difference between @tensor and @Einsum is that @Einsum allows pairs of indices on the RHS to exist on the LHS. Since @tensor can be orders of magnitude faster than the loop approach for contraction (because of BLAS), is there a way to replace the inner loops with @tensor, keeping the broadcasting in the outer loops? I imagine that would greatly improve performance. For example,
if

C = zeros(50,100)
B = randn(50,100)
A = randn(50,50,100)

Then the generated code could look like:

#@einsum C[i,k] = A[i,j,k] * B[j,k]
for k in 1:size(A,3)
  vC = view(C,:,k)
  vA = view(A,:,:,k)
  vB = view(B,:,k)
  @tensor vC[i] = vA[i,j] * vB[j]
end

ahwillia · 2017-04-18T17:38:31Z

This would be a great addition to this package! Feel free to open a PR. At the moment, I am working on other projects and do not have time to implement this. Similarly, you could think about swapping out operations for BLAS calls directly (e.g. identifying sub-problems that are matrix multiplies).

GiggleLiu · 2018-09-13T08:10:58Z

Unreasonable allocation comparing with @Jutho 's TensorOperations (even with pre-allocation)

a = randn(200,2,200)
b = randn(200,2,190)
c = randn(200,2,2,190)

############# Einsum.jl #################
julia> @benchmark @einsum c[i,j,l,m] = a[i,j,k]*b[k,l,m]
BenchmarkTools.Trial: 
  memory estimate:  2.74 GiB
  allocs estimate:  152762661
  --------------
  minimum time:     5.796 s (6.30% GC)
  median time:      5.796 s (6.30% GC)
  mean time:        5.796 s (6.30% GC)
  maximum time:     5.796 s (6.30% GC)
  --------------
  samples:          1
  evals/sample:     1
-----------------  an update  ---------------------
julia> @benchmark @einsum $c[i,j,l,m] = $a[i,j,k]*$b[k,l,m]
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     45.007 ms (0.00% GC)
  median time:      45.173 ms (0.00% GC)
  mean time:        45.280 ms (0.00% GC)
  maximum time:     47.874 ms (0.00% GC)
  --------------
  samples:          111
  evals/sample:     1

############# TensorOperations.jl ##############
julia> @benchmark tensorcontract(a, (1,2,3), b, (3,4,5), (1,2,4,5))
BenchmarkTools.Trial: 
  memory estimate:  1.16 MiB
  allocs estimate:  30
  --------------
  minimum time:     1.244 ms (0.00% GC)
  median time:      1.287 ms (0.00% GC)
  mean time:        1.453 ms (3.73% GC)
  maximum time:     60.382 ms (97.04% GC)
  --------------
  samples:          3415
  evals/sample:     1

Update note

According to the benchmark result, there is a matrix type related type instability. With enhanced type stability, the performance of Einsum.jl is much better. But still much slower than TensorOperations.jl.

GiggleLiu mentioned this issue Oct 26, 2018

How can I perform star contraction? Jutho/TensorOperations.jl#50

Open

GiggleLiu mentioned this issue Nov 12, 2018

Feature request: gramma to enforce type stability proactively JuliaLang/julia#30007

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance #1

Performance #1

faroit commented Apr 18, 2016 •

edited

Loading

Jutho commented Apr 18, 2016

ahwillia commented Apr 18, 2016

faroit commented Apr 19, 2016

ahwillia commented Apr 19, 2016 •

edited

Loading

ahwillia commented Jul 7, 2016

mihirparadkar commented Apr 18, 2017

ahwillia commented Apr 18, 2017

GiggleLiu commented Sep 13, 2018 •

edited

Loading

Performance #1

Performance #1

Comments

faroit commented Apr 18, 2016 • edited Loading

numpy

Julia

Jutho commented Apr 18, 2016

ahwillia commented Apr 18, 2016

faroit commented Apr 19, 2016

ahwillia commented Apr 19, 2016 • edited Loading

ahwillia commented Jul 7, 2016

mihirparadkar commented Apr 18, 2017

ahwillia commented Apr 18, 2017

GiggleLiu commented Sep 13, 2018 • edited Loading

Update note

faroit commented Apr 18, 2016 •

edited

Loading

ahwillia commented Apr 19, 2016 •

edited

Loading

GiggleLiu commented Sep 13, 2018 •

edited

Loading