Can an ONNX network be incompatible with onnxruntime? - c++

I am having trouble running inference on an ONNX model, either by making (tiny) adjustments to this Windows ML tutorial, or by implementing my own ONNX Runtime code following their MNIST Tutorial. As I understand it, Windows ML makes use of ONNX Runtime, so both efforts probably end up in the same place... and probably generating the same underlying exception for the same reason.
The exceptions thrown are either unintelligible (a second exception thrown during exception handling by the looks...) or not identified at all. It makes me wonder if the network itself is faulty or incompatible in some sense. The network was produced by taking a saved Tensorflow/Keras model and running this conversion:
python -m tf2onnx.convert --saved-model MyNet --output MyNet.onnx --inputs-as-nchw mobilenetv2_1_00_224_input:0
The result is a network that is rendered by Netron with the following input & output stages:
Is there anything about this network that is obviously incompatible with ONNX Runtime? Any suggestions on how to push past either/both of these exceptions?

Turns out that in my attempt to adapt Windows ML example, I had the output shape wrong - in that example, the output shape is 1 x 1000 x 1 x 1. I had copied/pasted this, and just modified the 1000 to suit. Clearly the network above needs a 1 x 10 shape....

Related

Demo Code for Detectron Not Detecting Object Instances

I am trying to get the demo code for Detectron2 working locally on my laptop. Everything appears to run correctly, but no object instances are detected, even when I use the image from the Colab demo.
I am running on a non-GPU Mac. I followed the installation instructions to install Detectron. I have the following module versions on my machine:
detectron2#git+https://github.com/facebookresearch/detectron2.git#ea3b3f22bf1de58008599794f149149ff65d3780
opencv-python==4.5.3.56
torch==1.9.0
torchvision==0.10.0
I copied demo.py, predictor.py, mask_rcnn_R_101_FPN_3x.yaml, and Base-RCNN-FPN.yaml from Detectron's github. I then ran inference demo with pretrained model command. The specific command was this:
python demo.py --input 000000439715.jpeg --output output --config-file mask_rcnn_R_101_FPN_3x.yaml --opts MODEL.WEIGHTS detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl MODEL.DEVICE cpu
000000439715.jpeg is the sample image of the man on horseback from the Colab notebook demo. The last line of the output is
000000439715.jpeg: detected 0 instances in 6.77s
The image in the output directory has no annotation on it.
The logging output looks okay to me. The only thing that may be an indication of a problem is a warning at the top
[08/28 12:35:18 detectron2]: Arguments: Namespace(confidence_threshold=0.5, config_file='mask_rcnn_R_101_FPN_3x.yaml', input=['000000439715.jpeg'], opts=['MODEL.WEIGHTS', 'detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl', 'MODEL.DEVICE', 'cpu'], output='output', video_input=None, webcam=False)
[08/28 12:35:18 fvcore.common.checkpoint]: [Checkpointer] Loading from detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl ...
[08/28 12:35:18 fvcore.common.checkpoint]: Reading a file from 'Detectron2 Model Zoo'
WARNING [08/28 12:35:19 fvcore.common.checkpoint]: Some model parameters or buffers are not found in the checkpoint:
I'm not sure what to do about it though.
I tried not specifying the model weights. I also tried setting the confidence threshold to zero. I got the same results.
Am I doing something wrong? What are the next debugging steps?
I met the same question with you, just like:
WARNING [xxxxxxxxx fvcore.common.checkpoint]: Some model parameters or buffers are not found in the checkpoint:
and this warning made my result very bad. Finally I found that I use a wrong weight file.
Hope this can help you.

ArrayFire convolution issue with Cuda backend

I've been having an issue with a certain function call in the
dphaseWeighted = af::convolve(dphaseWeighted, m_slowTimeFilter);
which seem to produce nothing but nan's.
The back ground is we have recently switched from using AF OpenCL to AF Cuda and the problem that we are seeing happens in the function.
dphaseWeighted = af::convolve(dphaseWeighted, m_slowTimeFilter);
This seems to work well when using OpenCL.
Unfortunatley, I can't give you the whole function because of IP. only a couple of snippets.
This convolve lies deep with in a phase extract piece of code. and is actualy the second part of that code which uses the af::convolve funtion.
The first function seems to behave as expected, with sensible floating point data out.
but then when it comes to the second function all I'm seeing is nan's coming out ( I view that with af_print amd dumping the data to a file.
in the CMakeList I include
include_directories(${ArrayFire_INCLUDE_DIRS})
and
target_link_libraries(DASPhaseInternalLib ${ArrayFire_CUDA_LIBRARIES})
and it builds as expected.
Has anyone experience any think like this before?

Assertion sv_count !=0 failed - Function train_auto, SVM type - EPS_SVR

The question is related to the OpenCV library, version 2.4.13.2.
I am using n dimensional feature vectors from images for training and performing regression. The output values range between 0 and 255.
The function CvSVM::train works without an error, but requires a manual setting of parameters. So, I would prefer using the function CvSVM::train_auto to perform cross validation and determine the best parameters for the situation.
But I am facing the error:
OpenCV Error: Assertion failed (sv_count != 0) in CvSVM::do_train.
On changing the type to NU_SVR, it works well. The problem is only with type EPS_SVR.
I would appreciate any help I could receive to fix this.
EDIT: I was able to pinpoint the problem to line Number 1786 in the file-
opencv-master\sources\modules\ml\src\svm.cpp
FOR_IN_GRID(p, p_grid)
Upon commenting it, the code runs without errors. I am unaware of the reasons possible.
Facing the same bug. Found out that this bug was caused by svm.setP(x) and svm.setTermCriteria((cv2.TERM_CRITERIA_EPS, y)) where x and y values more than 0.1 (10^-1).

Running the executable of hdl_simple_viewer.cpp from Point Cloud Library

The Point Cloud library comes with an executable pcl_hdl_viewer_simple that I can run (./pcl_hdl_viewer_simple) without any extra arguments to get live data from a Velodyne LIDAR HDL32.
The source code for this program is supposed to be hdl_viewer_simple.cpp. A simplified version of the code is given on this page which cannot be compiled readily and requires a tiny bit of tweaking to make it compile.
My problem is that the executable that I build myself for both the versions are not able to run. I always get the smart pointer error "Assertion px!=0" error. I am not sure if I am not executing the program in the correct way or what. The executable is supposed to be executed like
./hdl_viewer_simple -calibrationFile hdl32calib.xml -pcapFile file.pcap
in case of playing from previously recorded PCAP files or just ./hdl_viewer_simple if wanting to get live data from the real sensor. However, I always get the assertion failed error.
Has anyone been able to run the executables? I do not want to use the ROS drivers
"Assertion px!=0" is occurring because your pointer is not initialized.
Now that being said, you could initialize it inside your routines, in case the pointer is NULL, especially for data input.
in here, you can try updating the line 83 like this :
CloudConstPtr cloud(new Cloud); //initializing your pointer
and hopefully, it will work.
Cheers,

my c++ extension behaves differently with faulthandler

Background
I have a C++ extension which runs a 3D watershed pass on a buffer. It's got a nice Cython wrapper to initialise a massive buffer of signed chars to represent the voxels. I initialise some native data structures in python (in a compiled cython file) and then call one C++ function to initialise the buffer, and another to actually run the algorithm (I could have written these in Cython too, but I'd like it to work as a C++ library as well without a python.h dependancy.)
Weirdness
I'm in the process of debugging my code, trying different image sizes to gauge RAM usage and speed, etc, and I've noticed something very strange about the results - they change depending on whether I use python test.py (specifically /usr/bin/python on Mac OS X 10.7.5/Lion, which is python 2.7) or python and running import test, and calling a function on it (and indeed, on my laptop (OS X 10.6.latest, with macports python 2.7) the results are also deterministically different - each platform/situation is different, but each one is always the same as itself.). In all cases, the same function is called, loads some input data from a file, and runs the C++ module.
A note on 64-bit python - I am not using distutils to compile this code, but something akin to my answer here (i.e. with an explicit -arch x86_64 call). This shouldn't mean anything, and all my processes in Activity Monitor are called Intel (64-bit).
As you may know, the point of watershed is to find objects in the pixel soup - in 2D it's often used on photos. Here, I'm using it to find lumps in 3D in much the same way - I start with some lumps ("grains") in the image and I want to find the inverse lumps ("cells") in the space between them.
The way the results change is that I literally find a different number of lumps. For exactly the same input data:
python test.py:
grain count: 1434
seemed to have 8000000 voxels, with average value 0.8398655
find cells:
running watershed algorithm...
found 1242 cells from 1434 original grains!
...
however,
python, import test, test.run():
grain count: 1434
seemed to have 8000000 voxels, with average value 0.8398655
find cells:
running watershed algorithm...
found 927 cells from 1434 original grains!
...
This is the same in the interactive python shell and bpython, which I originally thought was to blame.
Note the "average value" number is exactly the same - this indicates that the same fraction of voxels have initially been marked as in the problem space - i.e. that my input file was initialised in (very very probably) exactly the same way both times in voxel-space.
Also note that no part of the algorithm is non-deterministic; there are no random numbers or approximations; subject to floating point error (which should be the same each time) we should be performing exactly the same computations on exactly the same numbers both times. Watershed runs using a big buffer of integers (here signed chars) and the results are counting clusters of those integers, all of which is implemented in one big C++ call.
I have tested the __file__ attribute of the relevant module objects (which are themselves attributes of the imported test), and they're pointing to the same installed watershed.so in my system's site-packages.
Questions
I don't even know where to begin debugging this - how is it possible to call the same function with the same input data and get different results? - what about interactive python might cause this (e.g. by changing the way the data is initialised)? - Which parts of the (rather large) codebase are relevant to these questions?
In my experience it's much more useful to post ALL the code in a stackoverflow question, and not assume you know where the problem is. However, that is thousands of lines of code here, and I have literally no idea where to start! I'm happy to post small snippets on request.
I'm also happy to hear debugging strategies - interpreter state that I can check, details about the way python might affect an imported C++ binary, and so on.
Here's the structure of the code:
project/
clibs/
custom_types/
adjacency.cpp (and hpp) << graph adjacency (2nd pass; irrelevant = irr)
*array.c (and h) << dynamic array of void*s
*bit_vector.c (and h) << int* as bitfield with helper functions
polyhedron.cpp (and hpp) << for voxel initialisation; convex hull result
smallest_ints.cpp (and hpp) << for voxel entity affiliation tracking (irr)
custom_types.cpp (and hpp) << wraps all files in custom_types/
delaunay.cpp (and hpp) << marshals calls to stripack.f90
*stripack.f90 (and h) << for computing the convex hulls of grains
tensors/
*D3Vector.cpp (and hpp) << 3D double vector impl with operators
watershed.cpp (and hpp) << main algorithm entry points (ini, +two passes)
pywat/
__init__.py
watershed.pyx << cython class, python entry points.
geometric_graph.py << python code for post processing (irr)
setup.py << compile and install directives
test/
test.py << entry point for testing installed lib
(files marked * have been used extensively in other projects and are very well tested, those suffixed irr contain code only run after the problem has been caused.)
Details
as requested, the main stanza in test/test.py:
testfile = 'valid_filename'
if __name__ == "__main__":
# handles segfaults...
import faulthandler
faulthandler.enable()
run(testfile)
and my interactive invocation looks like:
import test
test.run(test.testfile)
Clues
when I run this at the straight interpreter:
import faulthandler
faulthandler.enable()
import test
test.run(test.testfile)
I get the results from the file invocation (i.e. 1242 cells), although when I run it in bpython, it just crashes.
This is clearly the source of the problem - hats off to Ignacio Vazquez-Abrams for asking the right question straight away.
UPDATE:
I've opened a bug on the faulthandler github and I'm working towards a solution. If I find something that people can learn from I'll post it as an answer.
After debugging this application extensively (printf()ing out all the data at multiple points during the run, piping outputs to log files, diffing the log files) I found what seemed to cause the strange behaviour.
I was using uninitialised memory in a couple of places, and (for some bizarre reason) this gave me repeatable behaviour differences between the two cases I describe above - one without faulthandler and one with.
Incidentally, this is also why the bug disappeared from one machine but continued to manifest itself on another, part way through debugging (which really should have given me a clue!)
My mistake here was to assume things about the problem based on a spurious correlation - in theory the garbage ram should have been differently random each time I accessed it (ahh, theory.) In this case I would have been quicker finding the problem with a printout of the main calculation function and a rubber duck.
So, as usual, the answer is the bug is not in the library, it is somewhere in your code - in this case, it was my fault for malloc()ing a chunk of RAM, falsely assuming that other parts of my code were going to initialise it (which they only did sometimes.)