[WIP] Adagrad solver for arbitrary orders. #8

vene · 2016-09-12T22:38:58Z

Implements the SG algorithm from Higher-order Factorization Machines.
Mathieu Blondel, Akinori Fujino, Naonori Ueda, Masakazu Ishihata.
In Proceedings of Neural Information Processing Systems (NIPS), December 2016.

Todos

Explicit fitting of the lower matrices, as in CD
Benchmarks
Consider avoiding the numpy import in cython
Unify callback API with CD before merging
Test coverage.

hlin117 · 2016-09-15T06:32:33Z

polylearn/kernels.py

    Returns
    -------
    Gram matrix : array of shape (n_samples_1, n_samples_2)
    """
-    if degree == 2:
+
+    if degree > 3 or method == 'dp':


What about degree == 3?

The cases degree in (2, 3) are dealt with using the old closed-form approach. My benchmarks show this is faster in batch settings like this.

hlin117 · 2016-09-15T06:33:42Z

polylearn/kernels_fast.pxd

+
+    cdef Py_ssize_t t, j, jj
+
+    for jj in range(nnz + 1):


Cython's numpy should support low level vectorized = operations, right?

You mean A[:, 0] = 1? I think that works only for numpy arrays, not sure if it's implemented for generic memoryviews. I'll give it a try.

vene · 2016-09-15T11:59:53Z

polylearn/kernels_fast.pxd

@@ -0,0 +1,67 @@
+cdef inline double _fast_anova_kernel(double[::1, :] A,


This file needs the cython directives for efficiency, too. I added them but didn't get to commit yet, because I'm in the middle of some profiling: the adagrad solver is at the moment slower than I expected.

EthanRosenthal · 2016-11-11T20:35:23Z

Is there anything I can do to help complete this PR?

vene · 2016-11-11T21:10:41Z

Hi @EthanRosenthal

I'm a bit busy these days, but I plan to focus on this project and get a release out by the end of the year.

The current problem with this PR is that the adagrad solver seems to take quite more time per epoch than CD. If you manage to pinpoint the issue, it would be great.

--(Let me try to push the work-in-progress benchmark code that I have around.)-- edit: it was already up

EthanRosenthal · 2016-11-11T21:21:09Z

@vene Sounds good - I'm happy to take a look and see if I can improve the performance.

vene · 2016-12-21T03:08:08Z

Making P and P-like arrays (especially the gradient output) fortran-ordered makes this ~1.5x faster, but coordinate descent solvers are still faster per iteration, it seems. Also weird that degree 3 is faster than degree 2. I set a very low tolerance to prevent it from converging early, but I should check why.

I just realized it might be because the alpha and beta regularization parameters have different meanings for the two solvers unless accounted for.

(with P in C order all the time)
Classifier            train       test         f1   accuracy
------------------------------------------------------------
fm-3-ada          129.4632s    0.3683s     0.0000     0.9576
fm-2-ada          199.0683s    0.1267s     0.0000     0.9576
fm-2               15.9484s    0.1455s     0.1437     0.7594
polynet-3          15.5964s    0.0836s     0.1624     0.8425
fm-3               21.2889s    0.5154s     0.3695     0.9429
polynet-2          12.9678s    0.0915s     0.4800     0.9620

(with P (and everything else) in F order all the time)
Classifier            train       test         f1   accuracy
------------------------------------------------------------
fm-2-ada          138.9727s    0.1163s     0.0000     0.9576
fm-3-ada           72.2692s    0.3488s     0.0000     0.9576
fm-2               16.2693s    0.1453s     0.1437     0.7594
polynet-3          15.6552s    0.0843s     0.1624     0.8425
fm-3               21.4048s    0.5239s     0.3695     0.9429
polynet-2          13.0174s    0.0924s     0.4800     0.9620

mblondel · 2016-12-21T04:27:31Z

Indeed the regularization are not the same because of the 1/n factor in front of the loss when using stochastic gradient algorithms.

Regularization scaling is now ON by default. I think this is sensible, because it keeps the choice independent of data split. Adagrad seems very sensitive to the initial norm of P, so I changed the init to have unit variance rather than 0.01. Makes benchmark more reasonable but norms are still weird. Finnicky tests (fm warm starts) had to be updated, but most things behave well.

vene · 2016-12-21T17:06:54Z

Here's the performance after a bunch of tweaking and making the problem easier.

I'm printing out the norm of P to emphasize the big inconsistency of the solutions, even after correctly setting regularization terms. This is weird. When initializing P with standard deviation 0.01, adagrad sets it to zero very quickly. (especially with lower learning rates).

20 newsgroups
=============
X_train.shape = (1079, 130107)
X_train.dtype = float64
X_train density = 0.0013896454072061155
y_train (1079,)
Training class ratio: 0.4448563484708063
X_test (717, 130107)
X_test.dtype = float64
y_test (717,)

Classifier Training
===================
Training fm-2 ... done
||P|| = 20542.6551563
Training fm-2-ada ... done
||P|| = 49.2109170758
Training fm-3 ... done
||P|| = 71852.8549548
Training fm-3-ada ... done
||P|| = 37.0119191191
Training polynet-2 ... done
Training polynet-3 ... done
Classification performance:
===========================

Classifier            train       test         f1   accuracy
------------------------------------------------------------
fm-3                6.5862s    0.0839s     0.4524     0.5105
fm-2                4.5888s    0.0178s     0.4645     0.5690
fm-3-ada           20.1919s    0.0684s     0.4993     0.4965
polynet-3           5.4179s    0.0099s     0.5114     0.5523
polynet-2           4.1679s    0.0099s     0.5449     0.5621
fm-2-ada           18.6872s    0.0126s     0.5698     0.5788

vene · 2016-12-21T17:10:42Z

Appveyor crash is not a test failure, for some reason I get Command exited with code -1073740940. No idea why right now...

vene · 2016-12-21T22:14:33Z

Now supports explicit fitting of lower orders. No performance degradation but the code is a bit unrolled and could be written clearer.

I'm a bit confused by the way adagrad reacts to the learning rate, especially in the example, and why it seems to shrink to 0 faster with lower learning rates. But the tests, at least, suggest things are sensible.

vene · 2016-12-22T00:53:52Z

On second thought, the windows crash is not a fluke.

The exit code -1073740940, in hex, is 0xC0000374, which apparently means heap corruption

It seems that my last commit fixed it.

vene added 2 commits September 12, 2016 15:10

add dynamic programming algorithm for high order ANOVA

447bf02

add adagrad solver for higher order factorization machines

1cd41ce

hlin117 reviewed Sep 15, 2016

View reviewed changes

vene commented Sep 15, 2016

View reviewed changes

vene added 4 commits September 15, 2016 17:07

Add cython directives to kernels

e201072

Allow csr

74f03d0

update benchmark script

016a940

Simplify gradient computation

c97b225

vene force-pushed the sg branch from 181b1ef to c97b225 Compare September 23, 2016 14:37

Make P-like arrays F-order in adagrad

292512b

vene added 3 commits December 21, 2016 15:38

Fit lower orders in adagrad; improve tests

6c7cfd6

Update for common tests

93fefc6

improve coverage and support older numpy tests

3fc0fc5

vene force-pushed the sg branch from f98f2b1 to 3fc0fc5 Compare December 21, 2016 21:53

declare a couple of vars

ca05e0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Adagrad solver for arbitrary orders. #8

[WIP] Adagrad solver for arbitrary orders. #8

vene commented Sep 12, 2016 •

edited

Loading

hlin117 Sep 15, 2016

vene Sep 15, 2016

hlin117 Sep 15, 2016

vene Sep 15, 2016

vene Sep 15, 2016

EthanRosenthal commented Nov 11, 2016

vene commented Nov 11, 2016 •

edited

Loading

EthanRosenthal commented Nov 11, 2016

vene commented Dec 21, 2016

mblondel commented Dec 21, 2016

vene commented Dec 21, 2016

vene commented Dec 21, 2016

vene commented Dec 21, 2016

vene commented Dec 22, 2016

		@@ -0,0 +1,67 @@
		cdef inline double _fast_anova_kernel(double[::1, :] A,

[WIP] Adagrad solver for arbitrary orders. #8

Are you sure you want to change the base?

[WIP] Adagrad solver for arbitrary orders. #8

Conversation

vene commented Sep 12, 2016 • edited Loading

hlin117 Sep 15, 2016

Choose a reason for hiding this comment

vene Sep 15, 2016

Choose a reason for hiding this comment

hlin117 Sep 15, 2016

Choose a reason for hiding this comment

vene Sep 15, 2016

Choose a reason for hiding this comment

vene Sep 15, 2016

Choose a reason for hiding this comment

EthanRosenthal commented Nov 11, 2016

vene commented Nov 11, 2016 • edited Loading

EthanRosenthal commented Nov 11, 2016

vene commented Dec 21, 2016

mblondel commented Dec 21, 2016

vene commented Dec 21, 2016

vene commented Dec 21, 2016

vene commented Dec 21, 2016

vene commented Dec 22, 2016

vene commented Sep 12, 2016 •

edited

Loading

vene commented Nov 11, 2016 •

edited

Loading