-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Utility to help developers debug operators: Tensor Inspector #15490
Conversation
Awesome! thank you for the nice utility. 👍 for clear and detailed PR description. |
I could probably rewrite build_checker() (https://github.com/apache/incubator-mxnet/pull/15490/files#diff-9e7d5c2420ecc900c4a85a3c35d91bffR446) with macro to suppress warnings when DType is not floating point. But it's not causing any issues because in such cases the control flow wouldn't go by the undefined operation. Will change if have time. |
After experimenting for several hours today, I think there is not a good way to breach at compile time to suppress the warnings. Macro apparently did not work. I also tried simulating C++17 style Having no warnings when compiling is nice, but given that this is a developer's tool and that behavior-wise the code has no issue, I think I will leave the warnings. |
@apeforest @anirudh2290 @access2rohit @larroy @sandeep-krishnamurthy , |
Can you split the PR into 2 separate PRs. One for CPU and one for GPU ? |
Just realized there were more than one Tensor structs. Added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Created a separate PR for a new tutorial on usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I see this being very useful for operator developers.
LGTM just address my last comment and no need to split this PR into 2 now since you have already addressed most comments |
retrigger the build ... there is some julia error. Probably not because of your code changes. Also add link to the test cases that you covered |
cases covered are described in my tutorial at the end. Link: #15517 |
@sandeep-krishnamurthy Looks ready! |
src/common/tensor_inspector.h
Outdated
if (pos->size() > static_cast<size_t>(dimension)) { | ||
return false; | ||
} | ||
for (unsigned i = 0; i < pos->size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace unsigned
with size_t
src/common/tensor_inspector.h
Outdated
const int sub_dim = dimension - pos.size(); | ||
sub_shape->resize(sub_dim); | ||
index_t multiple = 1; | ||
for (int i = pos.size(), j = 0; i < dimension; ++i, ++j) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change int to index_t or size_t
src/common/tensor_inspector.h
Outdated
} | ||
index_t sum = 0; | ||
index_t m = 1; | ||
for (int i = pos.size() - 1; i >= 0; --i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change int to index_t
Three more changes. Otherwise good to go. Thanks for your contribution. |
After a few more changes requested by Lin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
struct InspectorManager { | ||
static InspectorManager* get() { | ||
static std::mutex mtx; | ||
static std::unique_ptr<InspectorManager> im = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an utility function encapsulating this idiom? this is being used in several places, and would be good to use this.
Created a separate PR for a new tutorial on usage.
#15517
Background
When developing new operators, developers need to deal with Tensors, TBlobs, and NDArrays extensively (for simplicity, we will simply refer them as “tensors”). However, there is no existing tool that helps developers inspect the value of the tensors within an operator, so it’s hard to locate the spot where the computation goes wrong. With that said, we can create a new utility that supports inspecting, dumping and verifying tensor values to alleviate the developer’s need to use hacky prints every time. This new utility will support all the three data types (Tensor, TBlob, NDArray), as well as both CPU and GPU tensors.
Some related GitHub issues: #15198
Methodology
First off, to support all three data types, we will need to first convert them to TBlob, so that we can base our new functionalists on one single class and avoid code repetition. To access the value of a tensor/TBlob object, the easiest way is to add some function, such as to_string(), to the TBlob class. Then, we could use shape_ to dereference dptr and extract the actual data from the tensor in a structured format. However, to preserve the integrity of the TBlob class, we would want to avoid making direct changes to tensor_blob.h. Instead, considering that all the member variables of TBlob are all public, we could create a wrapper around TensorBlob, which we would call TensorInspector, and we could then add our new functions to this new wrapper class. We could put this new file tensor_inspector.h in src/common along with a bunch of other utilities.
Besides tensor inspectors which are one-to-one with tensors, I also propose to have a singleton class called InspectorManage which is responsible for data sharing and maintaining locking between the individual TensorInspector objects. This way, we could control some global behaviors within each of the objects.
The Value of Creating a New Utility
It’s important to point out that there is an existing utility that helps print out the value of a tensor, tests/cpp/include/test_util.h, (https://github.com/apache/incubator-mxnet/blob/master/tests/cpp/include/test_util.h#L472). However, this new utility under discussion aims to start from the use cases of the users, who in our case are developers, and has more functionalities than the existing utility. Note: this new utility will use CAccessAsCPU from test_util.h for data copying from GPU memory to CPU memory.
Proposed Funtionalities/APIs
Create a TensorInspector Object from Tensor, TBlob, and NDarray Objects
We can create a TensorInspector object by passing in a Tensor, TBlob, or NDArray object.
Print Tensor Value (static)
We could use
std::string TensorInspector::to_string()
orvoid TensorInspector::print_string()
to get a formatted string representation of the entire tensor. We could see the tensor value in a structured way, the data type, as well as the tensor shape.Interactively Print Tensor Value (dynamic)
During debugging, situations might occur that ahead of time (at compilation time), we do not know which part of a tensor we need to inspect. Also, sometimes we might want to pause the operator control flow to “zoom into” an erroneous part of a tensor multiple times until we are satisfied to move on. In this regard, we could use void
TensorInspector::interactive_print(std::string tag = "")
like a break point to dynamically specify the part of the tensor to inspect and whether we want to skip the break point. tag is an optional parameter where we can give the break points names so that we can distinguish them. A “Visit” counter would tell us how many times have we stepped into a break point with this tag name.As we can see from the screenshot, we would be in a loop that asks us for command. We would be prompted with the Tensor info, <f Tensor 20x1x5x5>, and we could enter coordinates such as “0,0” to print only a part of the tensor. We could also type “e” for the entire tensor, “b” to break from this loop, and “s” to break from this loop and skip all the future break points.
Check Tensor Value
Sometimes, developers might want to check if the tensor contains unexpected values which could be negative values, Nans, or others. We could have a templated api
std::vector<std::vector<int>>TensorInspector::validate_value(const ValueChecker& vc, bool interactive = false, std::string tag = "")
where vc is a bool value checker function that users can define, interactive is whether you want to check the values on the go (similar to interactive print, refer to the first screenshot below), and tag is the name you want to give to the call to identify it. The return is a vector of coordinates (represented with vector) at which checker evaluates to true.To the developer's convenience, we would also pre-define a bunch of value checkers. Refer to the Enum below:
Developers could skip wiring their own checker by using this API:
std::vector<std::vector<int>>TensorInspector::validate_value(ValueChecker vc, bool interactive = false, std::string tag = "")
This API can also be used creatively to find coordinates where the values equal to some required value or fall in some range.
Dump Tensor Value
Sometimes, we would want to dump the tensor to a file in binary mode, so that we could use, for example, a python script to further analyze the numbers. We could use void TensorInspector::dump_value(std::string tag) to achieve this. We would want to use .npy format for its compatibility with Numpy and its great load efficiency. A benchmark can be found here: https://github.com/mverleg/array_storage_benchmark.
In this screenshot, we can compare side by side the regular print (left) and the value dump (right). Even without formatting the dumped value, we can see it has a better accuracy.
As we can see in the screenshot, the output files are named "{tag}_{visit}.npy", where tag is the name that we give to the call, and visit is the visit count.
To load the file, we could just do:
Notice: in interactive print, we could also do value dump with command "d":
Usage
To test or use this new utility in your operator, just include the .h file and construct a TensorInspector object, say "ti", as described above. Then, you can use all functionalities listed above by writing, for example, "ti.print_string()" or "ti.interactive_print()". Finally, just run any python script that uses the operator that you modified.
TODOs
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments