I am working on a Qt Project that requires me to to work with Matlab c++ shared library. I am basically working with Images that I acquire, and I need to do further processing on them later.
It is absolutely necessary for me that I acquire Images in C Platform, and then call Matlab for processing whenever needed. My Images are coming at a high speed : some 100 frames per sec.
The problem is that whenever I am calling Matlab in a loop, I am able to process the acquired images, but not real time. It takes one or two seconds between the subsequent calls in Matlab. I am assuming it is flushing off the other images and just plotting some images.
Can you suggest me a way so that I can just call Matlab function once, and my inputs be changed in real time. I dont intend to use Matlab Engine because that would require me to have Matlab Installed in every computer, my project runs on.
Are you creating a library from MATLAB code using MATLAB Compiler, and expecting to be able to call it 100 times per second?
That's not going to happen - the overhead of calling the library is too high. It sounds like your library might also be doing some plotting, which is likely to take too long as well.
You could perhaps look into using MATLAB Coder to convert your MATLAB image processing algorithm to C code, and then integrate the C code directly into your main code. Much of Image Processing Toolbox is supported by MATLAB Coder, as is Computer Vision System Toolbox and much of the Signal Processing-related toolboxes.
Related
[Tensorflow (TF) on CPU]
I am using the skeleton code provided for C++ TF inference from GitHub [label_image/main.cc] in order to run a frozen model I have created in Python. This model is an FC NN with two hidden layers.
In my current project's C++ code, I run the NN's frozen classifier for each single image (8x8 pixels). For each sample, a Session->Run call takes about 0.02 seconds, which is expensive in my application, since I can have 64000 samples that I have to run.
When I send a batch of 1560 samples, the Session->Run call takes about 0.03 seconds.
Are these time measurements normal for the Session->Run Call? From the C++ end, should I send my frozen model batches of images and not single samples? From the Python end, are there optimisation tricks to alleviate that bottleneck? Is there a way to concurrently do Session-Run calls in C++?
Environment info
Operating System: Linux
Installed version of CUDA and cuDNN: N/A
What other attempted solutions have you tried?
I installed TF using the optimised instruction set for the CPU, but it does not seem to give me the huge time saving mentioned in StackOverflow
Unified the session for the Graph I created.
EDIT
It seems that MatMul is the bottleneck -- Any suggestions how to improve that?
Should I use 'optimize_for_inference.py' script for my frozen graph?
How can you measure the time in Python with high precision?
Timeline for feeding an 8x8 sample and getting the result in Python
Timeline for feeding an 8x8 batch and getting the result in Python
For the record, I have done two things that significantly increased the speed of my application:
Compiled TF to work on the optimised ISA of my machine.
Applied batching to my data samples.
Please feel free to comment here if you have questions about my answer.
I am doing some work with OpenCV on Android. My code was all on the Java interface and its performance was very bad, so I implemented almost all the image processing with JNI. So on Java I have:
#Override
public Mat onCameraFrame(CameraBridgeViewBase.CvCameraViewFrame inputFrame)
{
_imgBgrTemp = inputFrame.rgba();
jniFrame(_imgBgrTemp.nativeObj);
return _imgBgrTemp;
}
So jniFrame() function takes care of all the image processing in C++, and with this there was a good improvement in performance. But it is still around 8fps and my camera can do 30fps without any processing on the image.
Looking more close I saw that even while processing the code, it uses only 25% CPU of my Android, witch is a Zenfone 2 with a quad-core processor.
I am thinking about having it running in 4 threads so I would have o FIFO pool to receive frames, process, and display it in the received order.
I am thinking in use this:
Creating a Manager for Multiple Threads
So my questions are:
I am going the right way ?
Do you have any tips ?
What should I consider (As I am working with JNI) ?
I didn't post the jniFrame() here because I don't think it is very relevant as it is a Work in progress code, very big. But it is about recognizing a rubik cube and getting its colors. if you also can give me any tips on that... but I may create another question only about this part later.
An update:
I as just searching about using OpenCL, but it seeams even more complicated then multi-threading and I am not sure if the improvement would be good. would it be better then multi-threading ?
Following on this thread I reimplemented my image processing code to send in 10 images at once (i.e. I now have the num property of the input blob set to 100 instead of 10).
However, the time required to process this batch is 10 times bigger than originally. Which means that I did not get any performance increase.
Is that reasonable or did I make something wrong?
I am running Caffe in CPU mode. Unfortunately GPU mode is not an option for me.
Update: Caffe now natively supports parallel processing of multiple images when using multiple GPUs. Though it seems relatively simple to implement base on the current implementation of GPU parallelism, at the moment there's no similar support for parallel processing on multiple CPUs.
Considering that the main problem with implementing parallelism is the syncing you need during training If you just want to process your images in parallel (as opposed to training the model), then you could load several copies of the same network to memory (whether through python with multiprocessing or c++ with multi-threading), and process each image on a different network. It would be simple and quite effective, especially if you load the networks once and then just process a large amount of images. Nevertheless, GPUs are much faster :)
Caffe doesn't process multiple images in parallel, the only saving you get by batch processing several images is in the time it takes to transfer the image data back and forth between Caffe's framework, which could be significant when dealing with the GPU.
IIRC there are several attempts to make Caffe process images in parallel, but most focus on the GPU implementation (CUDNN, CUDA Streams etc.), with few attempts to add parallelism to the CPU code (OpenBLAS's multithread mode, or simply running on multiple threads). Of those I believe only the CUDNN option is currently part of the stable version of Caffe, but obviously requires a GPU. You can try to look at one of the pull requests about this matter on Caffe's github page and see if it works for you, but note that it might cause compatibilities issue with your current version.
This is one such version that in the past I've used, though it's no longer maintained: https://github.com/BVLC/caffe/pull/439
I've also noticed in the last comment of the above issue that there's some speed up to the CPU code on this pull request as well, though I've never tried it myself: https://github.com/BVLC/caffe/pull/2610
I am trying to understand interprocess communication in CUDA. I would like some help with being able to understand this concept and trying to apply this to a project I am doing.
I have a image acquisition system that provides N number of input images. Each raw input image is first processed and then, stored in a single variable called 'Result'. There are four functions which do the processing of the image, Aprocess, Bprocess, Cprocess and Dprocess. Each time a new image is acquired by the system, the four functions mentioned above are called to do the processing. The final image 'Result' is stored in Dprocess.
What I would like to do is:
Create a new process, 'process2', where I can hand off one (final) image stored in 'Result', each time that image is obtained, and put it in a buffer called 'Images'. I would like to do this for 10 images. 'process2' should wait for a new image to be passed to it and not terminate because the first process has to keep calling the four functions and get a final processed image.
What I have come across so far: cudaIpcGetMemHandle, cudaIpcOpenMemHandle and cudaIpcCloseMemHandle
Question: How do I use the above function names to achieve IPC?
Question: How do I use the above function names to achieve IPC?
The CUDA simpleIPC sample code demonstrates that.
There is also a brief mention of how to use CUDA IPC API in the programming guide.
Finally, the API itself is documented in the runtime API reference manual
Note that this functionality requires cc 2.0 or higher, and a 64-bit Linux OS.
I've been working on a facetracking system last couple of months and now I need to make everything run in parallel to increase the performance.
The main cpp file is:
int _tmain(int argc, _TCHAR* argv[])
{
cFrame.initCamFrames(20, 1600, 1200, 3); //INITIATES BUFFER FOR CAM FRAMES, 20 frames, res:1600x1200, 3bytes per pixel.
eyeTracking.initTrackingSystem(&cFrame); //INITIATES EYETRACKING SOFTWARE WITH POINTER TO THE BUFFER WHERE EYETRACKINGSOFTWARE GETS THE FRAMES TO SEARCH WITHIN. (opencv)
directShow directShowClass;
directShowClass.initiateDirectShow(false, &cFrame); //INITIATES DIRECTSHOW WITH POINTER TO BUFFER WHERE IT SHOULD SAVE FRAMES FROM CAM
directShowClass.runDirectShow(); //START CAPTURING FRAMES INTO BUFFER
eyeTracking.runTrackingSystem(); //START SEARCH FOR FACE AND EYES.
system("pause");
directShowClass.stopDirectShow();
}
I want "directShowClass.runDirectShow();" and "eyeTracking.runTrackingSystem();" to run in real parallel. now I think that they run as threads in pseudo-parallel. (simple printf in each method occur mixed up in the terminal).
I guess that making a program run in parallel is not that simple as I would like it to be. But I guess that it is possible :D
Please give me some advise where to start searching for information about how to paralellisize.
I have a dual core processor.
Thanks!
Printf is not thread-safe,aka it can mix up buffers like you encountered. You can run the process in pseudo-parallel (like switch each call to another processing step) or run it in hardware-concurrency (std::thread, pthreads, windows thread, boost::thread).
if you have a dual core processor you surely can take advantage of multi-core processing, I would suggest to use boost.
Just to clear out, by using threads you do get real parallelism. But remember that your computer is also running in it's cores other processes in background that occupy CPU time, so your functions are not always being executed.
In order to get some parallelism in C++ you have many options. I name three:
. The oldest most common way is to use the pthread library, which is built in into almost every compiler.
. The new C++ standard, called C++11 includes some native libraries to deal with multi-threading, you can check that out, but it is still not supported by every compiler. And most compilers that support it have only partial functionality. You also need to activate the standard explicitly. For example:
gcc -std=c++11
. Finally, if you are in the mood for some "higher level" stuff, you can put some effort in learning about the OpenMP framework, which uses pragma directives to annotate parallel tasks. The framework will then deal with all the creation of threads, so you can use your time in some other stuff.
P.S: The reason why the output comes out mixed is not because the threads run in pseudo-parallel, but because they are concurrently writting on the buffer. So when the buffer is dumped you see it as the threads wrote it. If any, this is a proof that they are actually running in parallel, but you are making them write their output in the same buffer ;)