I've been working on a facetracking system last couple of months and now I need to make everything run in parallel to increase the performance.
The main cpp file is:
int _tmain(int argc, _TCHAR* argv[])
{
cFrame.initCamFrames(20, 1600, 1200, 3); //INITIATES BUFFER FOR CAM FRAMES, 20 frames, res:1600x1200, 3bytes per pixel.
eyeTracking.initTrackingSystem(&cFrame); //INITIATES EYETRACKING SOFTWARE WITH POINTER TO THE BUFFER WHERE EYETRACKINGSOFTWARE GETS THE FRAMES TO SEARCH WITHIN. (opencv)
directShow directShowClass;
directShowClass.initiateDirectShow(false, &cFrame); //INITIATES DIRECTSHOW WITH POINTER TO BUFFER WHERE IT SHOULD SAVE FRAMES FROM CAM
directShowClass.runDirectShow(); //START CAPTURING FRAMES INTO BUFFER
eyeTracking.runTrackingSystem(); //START SEARCH FOR FACE AND EYES.
system("pause");
directShowClass.stopDirectShow();
}
I want "directShowClass.runDirectShow();" and "eyeTracking.runTrackingSystem();" to run in real parallel. now I think that they run as threads in pseudo-parallel. (simple printf in each method occur mixed up in the terminal).
I guess that making a program run in parallel is not that simple as I would like it to be. But I guess that it is possible :D
Please give me some advise where to start searching for information about how to paralellisize.
I have a dual core processor.
Thanks!
Printf is not thread-safe,aka it can mix up buffers like you encountered. You can run the process in pseudo-parallel (like switch each call to another processing step) or run it in hardware-concurrency (std::thread, pthreads, windows thread, boost::thread).
if you have a dual core processor you surely can take advantage of multi-core processing, I would suggest to use boost.
Just to clear out, by using threads you do get real parallelism. But remember that your computer is also running in it's cores other processes in background that occupy CPU time, so your functions are not always being executed.
In order to get some parallelism in C++ you have many options. I name three:
. The oldest most common way is to use the pthread library, which is built in into almost every compiler.
. The new C++ standard, called C++11 includes some native libraries to deal with multi-threading, you can check that out, but it is still not supported by every compiler. And most compilers that support it have only partial functionality. You also need to activate the standard explicitly. For example:
gcc -std=c++11
. Finally, if you are in the mood for some "higher level" stuff, you can put some effort in learning about the OpenMP framework, which uses pragma directives to annotate parallel tasks. The framework will then deal with all the creation of threads, so you can use your time in some other stuff.
P.S: The reason why the output comes out mixed is not because the threads run in pseudo-parallel, but because they are concurrently writting on the buffer. So when the buffer is dumped you see it as the threads wrote it. If any, this is a proof that they are actually running in parallel, but you are making them write their output in the same buffer ;)
Related
I am doing some work with OpenCV on Android. My code was all on the Java interface and its performance was very bad, so I implemented almost all the image processing with JNI. So on Java I have:
#Override
public Mat onCameraFrame(CameraBridgeViewBase.CvCameraViewFrame inputFrame)
{
_imgBgrTemp = inputFrame.rgba();
jniFrame(_imgBgrTemp.nativeObj);
return _imgBgrTemp;
}
So jniFrame() function takes care of all the image processing in C++, and with this there was a good improvement in performance. But it is still around 8fps and my camera can do 30fps without any processing on the image.
Looking more close I saw that even while processing the code, it uses only 25% CPU of my Android, witch is a Zenfone 2 with a quad-core processor.
I am thinking about having it running in 4 threads so I would have o FIFO pool to receive frames, process, and display it in the received order.
I am thinking in use this:
Creating a Manager for Multiple Threads
So my questions are:
I am going the right way ?
Do you have any tips ?
What should I consider (As I am working with JNI) ?
I didn't post the jniFrame() here because I don't think it is very relevant as it is a Work in progress code, very big. But it is about recognizing a rubik cube and getting its colors. if you also can give me any tips on that... but I may create another question only about this part later.
An update:
I as just searching about using OpenCL, but it seeams even more complicated then multi-threading and I am not sure if the improvement would be good. would it be better then multi-threading ?
Trying to understand command lists.
Well, command lists records my commands for rendering, but also binding resources, let's say a buffer with vertex data.
m_commandList->IASetVertexBuffers(0, 1, &m_vertexBufferView);
This records the binding of vertex buffer. What happends with the buffer at this moment ? What will happen if I change a content of this vertex buffer after recording it ? What will happen if i change the content of this vertex buffer after calling execute command list and gpu not finished it yet ?
I guess ExecuteCommandList is a asynchronous function call, am I right ? Does it execute all binding ( data transfer to gpu ) at once, or it executes commands one by one even all bindings ? Is the command list executed by a driver, or is is all sent to gpu ?
Well, becuase lack of good examples, I still have lots of questions. I would be happy, if you can answer few of them to make it clear.
With DirectX 12, synchronization is entirely the application's responsibility. You have to insert and check fences to make sure the GPU is done with your buffer before you modify it. For dynamic buffers, you need to do your own double/triple/n buffering.
When you call ExecuteCommandList is just queues it up to the GPU. It will take some time before the GPU actually picks it up and then completes it.
Be sure to check the DirectX Graphics Samples GitHub project. The samples there and the Mini Engine demo are good places to find example usage.
This design makes DirectX 12 extremely powerful as it gives the application direct control over lots of things that were 'magic' in the Direct3D 11 Runtime. That said, the resulting API is much harder to actually use unless you are already a good enough graphics engineer that you could write the Direct3D 11 Runtime. Using Direct3D 11 API is still a fine choice for a project too.
Following on this thread I reimplemented my image processing code to send in 10 images at once (i.e. I now have the num property of the input blob set to 100 instead of 10).
However, the time required to process this batch is 10 times bigger than originally. Which means that I did not get any performance increase.
Is that reasonable or did I make something wrong?
I am running Caffe in CPU mode. Unfortunately GPU mode is not an option for me.
Update: Caffe now natively supports parallel processing of multiple images when using multiple GPUs. Though it seems relatively simple to implement base on the current implementation of GPU parallelism, at the moment there's no similar support for parallel processing on multiple CPUs.
Considering that the main problem with implementing parallelism is the syncing you need during training If you just want to process your images in parallel (as opposed to training the model), then you could load several copies of the same network to memory (whether through python with multiprocessing or c++ with multi-threading), and process each image on a different network. It would be simple and quite effective, especially if you load the networks once and then just process a large amount of images. Nevertheless, GPUs are much faster :)
Caffe doesn't process multiple images in parallel, the only saving you get by batch processing several images is in the time it takes to transfer the image data back and forth between Caffe's framework, which could be significant when dealing with the GPU.
IIRC there are several attempts to make Caffe process images in parallel, but most focus on the GPU implementation (CUDNN, CUDA Streams etc.), with few attempts to add parallelism to the CPU code (OpenBLAS's multithread mode, or simply running on multiple threads). Of those I believe only the CUDNN option is currently part of the stable version of Caffe, but obviously requires a GPU. You can try to look at one of the pull requests about this matter on Caffe's github page and see if it works for you, but note that it might cause compatibilities issue with your current version.
This is one such version that in the past I've used, though it's no longer maintained: https://github.com/BVLC/caffe/pull/439
I've also noticed in the last comment of the above issue that there's some speed up to the CPU code on this pull request as well, though I've never tried it myself: https://github.com/BVLC/caffe/pull/2610
I am developing a cross-platform fractal explorer using Qt. I am experiencing a performance problem specifically when running on a single core CPU under Windows XP (program compiled with MSVC Express 2010), I haven't tried other versions of Windows. With two cores the program runs fine. It also runs fine under Linux with either one core or two cores (compiled with GCC).
The performance problem is something to do with calling a slot in the widget via the signal in the calculation thread. The widget contains a QImage and I pass a pointer to its pixels to the calculation thread. The thread calculates the fractal and plots the pixels to the image. At the end of each row, the thread emits a signal to the widget to tell it to update the display in the main thread. As I understand it, this is a queued connection.
With Windows and a single CPU the update is very slow, much slower than the calculation. It makes the program unusable.
The relevant code is similar to the Mandelbrot example in the Qt docs, except my signal has no arguments because the Qimage is located in the widget not the thread and I do not convert the QImage to a QPixmap.
Does anybody have any ideas of what the problem could be and how to go about solving it? Is it something to do with scheduling, time slicing allocation? Is there a compiler flag in MSVC that I need to set? Or do I need to modify my program some how?
Thanks very much!
You say the update is slower than the calculation - how much slower? Have you done any comprehensive profiling to see where exactly the bottleneck occurs? A cursory google finds this profiler which may help you.
Remember that for older CPU's, thread context switching is very expensive. This may be part of your problem, though again I don't know specifics.
I know this is probably general, please bear with me!
We've got a program that uses a web camera and, based on what the camera is seeing, runs certain functions. The program runs excellently on MacOS and Linux, and it compiles and it does run on Windows, but a couple of the functions, (including one that iterates pixel by pixel, 640x480) drop the FPS to 1 or less. Occasionally dropping it to freeze for a nunber of seconds.
Like I said, I know this is very general... I was just (desperately) hoping for anybody else's input on possible explanations? These same functions work fine on other platforms. I'm curious if possibly the camera's running in it's own thread, which gets bogged down? Maybe we just aren't looking in the right places to optimize? And is there possibly a resource on what to optimze when porting code to windows?
Thanks so much, and any input is very much appreciated!
<<< EDIT >>>
As for the video source code, I'm using ewclib and
const char * m_buffer;
EWC_Open(MEDIASUBTYPE_RGB24, 640, 480, FPS, true);
m_buffer = new unsigned char[EWC_GetBufferSize(0)];
EWC_GetImage(0, m_buffer);
What do you use to compile the program on Windows? Visual Studio? Cygwin? Are you sure you are not compiling a debug version? Have you turned on compiler optimization? You may also want to check your data types. You may be assuming int to be 64 bits, while you may be using 32-bit Windows, where it is 32 bits.
The hypothesis by rmeador that it's because Windows is slow is ridiculous: Aside from grabbing the picture, all actions are in userspace, no syscalls necessary. Therefore, I'd suggest removing all your recognition/function code and seeing whether the problem persists.
If this is the case, check your image grabbing mechanism. Maybe you are acquiring and releasing a handle to the camera everytime you take a picture.
Otherwise, use a normal profiler to find the weak spots. If you suspect pixel manipulation might be at fault, ensure that you do that in userspace. I'm not familiar with Windows programming but I can imagine the problem could be that you are operating on a Windows resource for the manipulation/reading and calling for every pixel.
Do you call EWC_Open for every frame, or only once at the start? If the library is implemented in DirectShow and EWC_Open starts the graph, it will be quite slow.