Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Feature Request] support of diag for N-d arrays #12327

Closed
jasonyu1996 opened this issue Aug 24, 2018 · 8 comments
Closed

[Feature Request] support of diag for N-d arrays #12327

jasonyu1996 opened this issue Aug 24, 2018 · 8 comments

Comments

@jasonyu1996
Copy link
Contributor

Hi!

I just found that the diag operator does not support N-d arrays where N > 2. According to my own experience, it could be made more useful if the N > 2 cases are properly designed. For example, I find it troublesome to take the diagonals of several matrices of the same shape at the same time. I know this task could be accomplished with a combination of arange, tile and pick, but it would be very complicated, confusing and error-prone. To support this, the behaviour when N > 2 could be designed as taking the diagonal of the last two axes, i.e., when fed with an array of shape [d1, d2, d3, ..., dn-2, dn-1, dn], where the diagonal of [dn-1, dn] is of length k, diag would return an array of shape [d1, d2, d3, ..., dn-2, k]. Of course, this could be designed to be more flexible (allowing specifying the axes to reduce, for example).

PyTorch provides a diag operator that behaves in the same way. Tensorflow actually splits it into two operators, diag and diag_part, the former of which constructs diagonal matrices and the latter takes diagonals from matrices. They are designed to support N > 2 but not in a way I find useful or flexible.

On the MXNet forum: https://discuss.mxnet.io/t/diag-for-n-d-arrays/1707

@ankkhedia
Copy link
Contributor

@mxnet-label-bot [Feature Request, NDArray]

@samskalicky
Copy link
Contributor

samskalicky commented Aug 29, 2018

Hi @jasonyu1996. Thanks for reporting this. Ive been looking to implement a trace operator in mxnet per this request #10500. In preparing for implementing trace (which is really just summing the diagonal of the matix) I also noticed the limited implementation of the diag operator.

Given that many MXNet users have data in the form of WxHxC (width, height, channel), and then add a 4th dimension for number/batch, what are your thoughts on general N-dimensionality approach for this operator? Is it necessary to support the general case?

And how about implementation of the diag operator, as you mentioned there are already existing and high performance implementation for each sub-computation required. Do you think there is opportunity for further performance improvement (memory, time, etc.) by fusing these together? Or do you think it would be best to just implement diag calling these sub-computations separately (inside the diag operator)?

Let me know if you have thoughts on this. I would be interested in working with you to implement this as well.

Heres the original diag issue: #9253

@jasonyu1996
Copy link
Contributor Author

jasonyu1996 commented Aug 29, 2018

Hi! Thank you for your response! I just paid a visit to the numpy interfaces for computing the diagonal, and noticed that besides numpy.diag (which is exactly where our the design of our diag operator comes from) numpy provides a second diagonal extracting function numpy.diagonal (https://www.numpy.org/devdocs/reference/generated/numpy.diagonal.html), which in my opinion is a good reference for extending the functionality of our diag operator (I also noticed that numpy.trace also supports N-d arrays and accepts the same set of arguments as numpy.diagonal). However, I am not sure whether a new operator should be added or not. I wonder why numpy provides two functions, one strictly weaker than the other, which do the same thing.

As for the implementation detail, I have to admit that I am not familiar with this and am therefore not sure about the possibility of further improving the performance by implementing it in ways other than simply fusing some high-level function calls together. I think it is possibly necessary to refer to the implementation of the 2-d case, which it seems does not depend on other high-level function calls (diag for 2-d arrays can also be implemented with an arange followed by a pick).

@samskalicky
Copy link
Contributor

@jasonyu1996 I have been prototyping a general diagonal operator based on the numpy implementation. The issue I see is that numpy uses views, so the diagonal operator simply returns a 'diagonal view' to the original data layout in memory. Where as we want an actual copy of the data in the diagonal operator in mxnet. Seemingly we should be able to just traverse the view to copy the resulting diagonal values.

Im not a deep learning expert, so I dont understand the need/usage of a diagonal operator in a network. Can you provide some motivation or use-cases for where a diagonal operator might be needed in a model?

@jasonyu1996
Copy link
Contributor Author

@samskalicky Yes. Actually I have already implemented one version in this way (#12430), simply traversing and copying the data. However, I actually think it more efficient to just generate a new view, as is the case in numpy. The problem here is that MXNet seems to lack a good support for this, as I have reported here
#12449 .

As for the possible usage of a diagonal operator, as far as I know, it is sometimes necessary to compute a regularizer based on the diagonal of a matrix (or the diagonals of a batch of matrices). trace might be sufficient at times, but in other cases, it might be more flexible to operate on the diagonal itself (e.g., when taking norm of the diagonal).

@samskalicky
Copy link
Contributor

Thanks @jasonyu1996 ! I'll review your PR tomorrow and see if we cant get it merged.

Thanks for the explanation of regularization. Can you provide an example model using the diag/trace operator in the layer calculation to further motivate this contribution?

@jasonyu1996
Copy link
Contributor Author

@samskalicky Thanks! I have just thought of two models (both for relation extraction/classification) that would be made more convenient to implement with the help of the diag/trace operator:

  • Learning with Noise: Enhance Distantly Supervised Relation Extraction with Dynamic Transition Matrix (https://arxiv.org/abs/1705.03995). They use a matrix to model the transition from the true label to the noisy one, whose trace appears in the loss as a regularizer.
  • Neural Relation Extraction with Selective Attention over Instances (http://aclweb.org/anthology/P16-1200). The output of this model is the diagonal of a matrix whose rows are first softmaxed. When fed with a batch of inputs, the outputs could be fetched by taking the diagonals of several matrices at the same time.

@sandeep-krishnamurthy
Copy link
Contributor

Thanks @jasonyu1996 for your contribution. Resolving.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants