Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault during CoreML NMS #752

Closed
dlawrences opened this issue Jul 6, 2020 · 16 comments
Closed

Segmentation fault during CoreML NMS #752

dlawrences opened this issue Jul 6, 2020 · 16 comments
Labels
awaiting response Please respond to this issue to provide further clarification (status) bug Unexpected behaviour that should be corrected (type) PyTorch (traced)

Comments

@dlawrences
Copy link

Hi team

I have created a model using the coremltools API. This takes the detections done by a YOLOv5 CNN (https://github.com/ultralytics/yolov5) converted from PyTorch to CoreML and does the following operations:

  • filters out predictions with objectness < 0.1
  • concatenates all the remaining predictions from different feature maps (generated for my case are 20x20, 40x40, 80x80)
  • slices xywh from the 8-dimensional YOLO prediction vector (i.e. x y w h o classx3)
  • slices confidence for all the three classes from the 8-dimensional YOLO prediction vector
  • multiples class confidence with objectness
  • sends these 2 tensors as input to the CoreML NMS

If I am showing the output of the tensors feeding into NMS ("slice_xywh_output" and "multiply_obj_conf_output"), these look like the following:

Shape of key slice_xywh_output
(1, 32, 4)
Data of key slice_xywh_output
[[[ 0.22485352  0.10229492 -0.58935547 -0.62695312]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [ 0.86621094  0.26000977  0.08728027  0.33251953]
  [-1.35449219  0.15258789  0.09832764  0.38842773]
  [ 0.28613281  0.1574707  -1.01953125 -0.796875  ]
  [-2.64648438  0.09161377 -1.03417969 -0.8203125 ]
  [ 0.84667969  0.2902832  -0.5234375   0.03759766]
  [-1.35351562  0.17749023 -0.49121094  0.08111572]
  [ 0.          0.          0.          0.        ]
  [ 0.33544922  0.17712402 -0.26904297 -0.42626953]
  [-2.36328125  0.33276367 -0.33203125 -0.51220703]
  [ 0.44677734 -2.09960938 -0.29321289 -0.52880859]
  [-0.30175781  1.50976562  0.22375488  0.23059082]
  [-0.52929688 -0.98681641  0.23474121  0.16210938]
  [ 0.54492188  0.37475586  1.33984375  3.04882812]
  [ 0.36499023  0.27441406 -0.69091797 -0.75634766]
  [-2.375       0.26928711 -0.67822266 -0.79638672]
  [ 0.54101562 -2.0546875  -0.70166016 -0.83154297]
  [-0.33911133  1.51953125 -0.30151367 -0.1673584 ]
  [ 1.81835938 -0.95898438 -0.18823242 -0.1619873 ]
  [-0.39160156 -0.99169922 -0.25439453 -0.24133301]
  [ 0.60986328  0.13586426  0.34594727  1.25488281]
  [-0.40820312  1.57617188 -0.92041016  0.27319336]
  [-0.37402344 -1.00976562 -0.90527344  0.24584961]
  [ 0.          0.          0.          0.        ]
  [ 0.3996582  -0.82910156  1.22265625  1.10742188]
  [ 0.50878906  1.27929688  0.00901031 -0.51855469]
  [ 0.17687988  0.84765625  1.88476562  2.375     ]
  [ 0.22387695 -0.90722656  1.93457031  1.92871094]
  [ 0.1920166   1.33789062 -0.06939697  1.67480469]
  [ 0.34277344 -0.66992188 -0.02412415  1.31542969]
  [ 0.          0.          0.          0.        ]]]
Shape of key multiply_obj_conf_output
(1, 32, 3)
Data of key multiply_obj_conf_output
[[[ -6.69376373   6.77657318 -12.07177734]
  [ -6.70858765   6.36968231  -8.99402618]
  [  7.53936768  -7.6975708   -9.77398682]
  [  6.46255493  -6.45073318  -7.97967911]
  [ -1.18617153   1.18326187  -2.25595665]
  [ -1.45161152   1.38298988  -1.9520216 ]
  [  8.32269287  -8.4524231  -10.12394714]
  [  6.39775085  -6.47781372  -7.074646  ]
  [  0.           0.           0.        ]
  [ -5.05448151   5.08499146  -5.42060089]
  [ -3.5579443    3.58856964  -3.909235  ]
  [ -3.24484062   3.00595665  -3.02586365]
  [ -4.54496384   4.41510773  -7.90267181]
  [ -7.25372314   7.07107544 -11.00234985]
  [  4.61853027  -4.69714355  -6.2722168 ]
  [ -4.49057007   4.47177124  -4.6456604 ]
  [ -3.77206802   3.68643188  -4.17170334]
  [ -3.68796921   3.64676285  -3.74790573]
  [ -6.08796692   5.90591049 -10.37721634]
  [ -0.74313879   0.73117495  -1.14116669]
  [ -9.19396973   9.21398926 -15.23486328]
  [  6.42531967  -6.49116707  -8.14428329]
  [ -5.39413643   5.56509018  -9.39154434]
  [ -7.76382446   8.20882416 -12.68722534]
  [  0.           0.           0.        ]
  [ -4.36952209  -3.31585693   3.15795898]
  [ -2.0806694   -1.78788662   1.93880558]
  [ -6.69223022   6.86398315  -7.12774658]
  [ -8.8374939    7.79048157  -9.8401413 ]
  [-11.71128082  12.10197449 -15.44669342]
  [-13.67028809  14.2027359  -18.90808868]
  [  0.           0.           0.        ]]]

However, sometimes, feeding these two tensors in a NMS layer like the following

builder.add_nms(
    name="nms",
    input_names=["slice_xywh_output", "multiply_obj_conf_output"],
    output_names=["raw_coordinates", "raw_confidence", "raw_indices", "num_boxes"],
    iou_threshold=0.5,
    score_threshold=0.4,
    max_boxes=15,
    per_class_suppression=True
)

triggers segmentation fault and the result looks wrong:

Shape of key raw_confidence
(1, 15, 3)
Data of key raw_confidence
[[[-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [-6.70858765  6.36968231 -8.99402618]
  [ 0.          0.          0.        ]]]
Shape of key raw_coordinates
(1, 15, 4)
Data of key raw_coordinates
[[[-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [-2.84179688  0.07049561 -0.58007812 -0.61083984]
  [ 0.          0.          0.          0.        ]]]
zsh: segmentation fault  python3 predict.py

This has been tested on a 2020 MacBook Pro (16 GB RAM, 2.0 Ghz processor).

Config

# Name                    Version                   Build  Channel
coremltools               4.0b1                    pypi_0    pypi

Any thoughts on this?

Thanks

@dlawrences dlawrences added the bug Unexpected behaviour that should be corrected (type) label Jul 6, 2020
@anilkatti
Copy link

@dlawrences thanks for reporting this. We are looking into this.

@bhushan23
Copy link
Contributor

@dlawrences can you share new mlmodel with nms layer?

@dlawrences
Copy link
Author

@dlawrences can you share new mlmodel with nms layer?

Sure. Could you please provide a private e-mail address so I can send it over? I wouldn't be able to share it publicly tho for now.

Cheers

@moto-apple
Copy link
Collaborator

Sure. Could you please provide a private e-mail address so I can send it over? I wouldn't be able to share it publicly tho for now.

Hello @dlawrences, could you file a radar and attach the model there? Thank you.
https://developer.apple.com/bug-reporting/

@dlawrences
Copy link
Author

Hi

I'll try to file the radar in the coming days.

In terms of the code, this is what I am doing:

builder.add_nms(
    name="nms",
    input_names=["raw_coordinates", "raw_confidence"],
    output_names=["coordinates", "confidence", "indices", "num_boxes"],
    iou_threshold=0.5,
    score_threshold=0.0,
    max_boxes=10,
    per_class_suppression=True
)

With the settings above, I am actually getting the following error:

objc[12134]: Attempted to unregister unknown __weak variable at 0x7f83c98913d0. This is probably incorrect use of objc_storeWeak() and objc_loadWeak(). Break on objc_weak_error to debug.

python3(12134,0x10efc8dc0) malloc: *** error for object 0x7f83c98912b0: pointer being freed was not allocated
python3(12134,0x10efc8dc0) malloc: *** set a breakpoint in malloc_error_break to debug

The description of the spec is the following:

input {
  name: "raw_results1"
  type {
    multiArrayType {
      shape: 1
      shape: 3
      shape: 18
      shape: 18
      shape: 8
      dataType: FLOAT32
    }
  }
}
input {
  name: "raw_results2"
  type {
    multiArrayType {
      shape: 1
      shape: 3
      shape: 36
      shape: 36
      shape: 8
      dataType: FLOAT32
    }
  }
}
input {
  name: "raw_results3"
  type {
    multiArrayType {
      shape: 1
      shape: 3
      shape: 72
      shape: 72
      shape: 8
      dataType: FLOAT32
    }
  }
}
output {
  name: "coordinates"
  type {
    multiArrayType {
      dataType: DOUBLE
    }
  }
}
output {
  name: "confidence"
  type {
    multiArrayType {
      dataType: DOUBLE
    }
  }
}
output {
  name: "indices"
  type {
    multiArrayType {
      dataType: DOUBLE
    }
  }
}
output {
  name: "num_boxes"
  type {
    multiArrayType {
      dataType: DOUBLE
    }
  }
}

The issue is sometimes I actually get segmentation fault as in the original post.

It is really weird and inconsistent from a run to another.

I tried passing data as both np.float32 and np.double, no real differences.

Any thoughts?

Thanks

@dlawrences
Copy link
Author

dlawrences commented Jul 14, 2020

One thing to be specified that's missing from the documentation. I am pretty sure the layer NonMaximumSuppressionLayerParams expects input input1 in normalized xywh format... which is not really specified:

https://apple.github.io/coremltools/coremlspecification/sections/NeuralNetwork.html#nonmaximumsuppressionlayerparams

@dlawrences
Copy link
Author

As described above, the NMS layer is the last step in the pipeline. Everything else is working as expected... I am providing for additional debugging purposes the actual data that enters the NMS layer.

The attached zip archive contains two files:

  • raw_coordinates.npy: binary file containing the data in tensor raw_coordinates that's served as input to the NMS layer
  • raw_confidence.npy: binary file containg the data in tensor raw_confidence that's served as input to the NMS layer

inputs.zip

@moto-apple
Copy link
Collaborator

Hi @dlawrences, thank you for the additional information. Unfortunately I have not been able to recreate the issue. We would like to have a radar with these attachments:

  • sysdiagnose (*) taken after the problem occurs.
  • Reproducible script or code
  • The model file.

(*) You can take sysdiagnose with the following command on the terminal. It creates a file something like sysdiagnose_2020.07.14_14-17-36-0700_macOS_MacBookPro16-1_xxxxxx.tar.gz. Please attach it to the radar.

$ sudo sysdiagnose

@pocketpixels
Copy link
Contributor

pocketpixels commented Jul 3, 2021

@dlawrences I ran into this as well. The issue in my case (and very likely in your case) was that Yolo does not use a softmax but predicts probabilities for all the classes independently. This means that the probabilities for multiple classes can sum up to more than 1. CoreML's NMS network expects the probabilities to sum to at most 1. When the sum exceeds 1 you get a crash (in my case when running on the Neural Engine on iOS).

PS edit: In what I wrote above I was maybe a bit overly confident that this is the same issue you are seeing. It very well might not be. But try limiting the predicted confidence values to sum to <= 1 before feeding them into the NMS layer and see if it fixes the crashes for you.

@dlawrences
Copy link
Author

dlawrences commented Jul 8, 2021

Thanks @pocketpixels. I moved away from this and ended up using a pipeline type of model, where the CNN is just a model in the pipeline feeding off results to another "model" that is based on the NMS implementation from CoreML. That seems to work as expected.

For reference, here's an example: https://github.com/hollance/coreml-survival-guide/blob/4dfcbb97c065726a3da240c55d90b2075959801d/MobileNetV2%2BSSDLite/ssdlite.py#L333

@pocketpixels
Copy link
Contributor

@dlawrences Actually, that's how I am using the CoreML NMS also, as a model in a pipeline. (Sorry, I clearly didn't read your original description closely enough).
For me, it only crashed in the NMS (with a completely non-descriptive error) when it was running on the Neural Engine (and when the probabilities summed to > 1). When running on GPU/CPU it ran fine, and would just return a bounding box confidence > 1.
I would still recommend changing your model to normalize the class probabilities to keep their sum from exceeding 1 to avoid surprise mystery crashes in the future.

@gomer-noah
Copy link

@dlawrences Any chance you'd still be willing to share the mlmodel you put together to decode YOLOv5 output? I'm trying to do the same using the builder but running into an "Error computing NN outputs" I can't get past. I'm sure it's something I'm doing wrong but I haven't been able to figure it out.

@pocketpixels
Copy link
Contributor

@gomer-noah You can try my CoreML export script
You might want to change nms.pickTop.perClass to True, depending on your application.

@gomer-noah
Copy link

@gomer-noah You can try my CoreML export script
You might want to change nms.pickTop.perClass to True, depending on your application.

I'll take a look. Thank you so much!

@TobyRoseman
Copy link
Collaborator

@dlawrences - did you ever submit the requested information to https://developer.apple.com/bug-reporting/ ? If so, what was the id value you received?

@TobyRoseman TobyRoseman added the awaiting response Please respond to this issue to provide further clarification (status) label Oct 13, 2021
@dlawrences
Copy link
Author

@TobyRoseman nope, I never went ahead with the radar report to Apple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response Please respond to this issue to provide further clarification (status) bug Unexpected behaviour that should be corrected (type) PyTorch (traced)
Projects
None yet
Development

No branches or pull requests

8 participants