I am trying to call feature detectors from OpenCV in my C++ application written in Visual Studio. I would like to run this operation in the background and do not care about its timing. Actually, I prefer if it is not interfering with the main performance of the application. To this end, I perform the feature detection in a separate thread and tried to lower the priority of the thread with the command SetThreadPriority(). This is however not working, and while the OpenCV function is running, all the CPU cores are maxed out. Is there any way to control the priority of the tasks in OpenCV or even limit the CPU cores involved in its process?
Although I didn't find any way to reduce the priority of an OpenCV job, I could reduce the CPU usage by using setNumThreads(int numThreads).
Related
I'm trying to boost the performance of my C++ facial tracking algorithm by optimizing the most time-consuming sections of my code. Interestingly, 86% of the entire tracking processing time is consumed by its feature extraction section. That feature extractor takes an image as input and returns the feature vector as output. I isolated the feature extractor from the tracker into a separate project to write an optimized version from scratch and made sure the input and output format and containers are the same as the ones used in the original tracker. Here's the breakdown of the resulting process time.
Original one 3 ms/frame
Optimized version 1 ms/frame
As can be seen it's supposed to be 3 times faster. But when I insert the new feature extractor into my tracker, inside the tracker it runs at about 2.5 ms/frame. I can't understand why it gets slower inside the tracker as opposed to when it runs as a standalone project?
But here's the catch that I discovered by accident. If I run the tracker and at the same time I run another project in the background, all of a sudden, the feature extractor inside the tracker would converge to 1 ms/frame. But as soon as I stop the background project it goes back to 2.5 ms/frame. This happens both on my laptop with Ubuntu 16.04 and on my Desktop PC with Windows 10.
In an attempt to understand this sort of behavior I used my Ubuntu System Monitor and noticed the following regarding all the 8 CPU cores.
When I run the feature extractor alone 7 cores are engaged by about 5% and one of the cores is engaged by 100%.
When I insert the feature extractor in the tracker and run the tracker, all the 8 cores are engaged by about 30%.
Now when I run the tracker and along with the tracker I run any other program (let's say the standalone feature extractor), out of all the cores that are engaged by 30%, one jumps up to 100% engagement which may explain the performance boost I get in the tracker as a result.
The evidence suggests that in order for the feature extractor to run at its maximum potential of 1 ms/frame, at least one of the cores should be engaged by 100%. My question is how can I make this happen?
Following on this thread I reimplemented my image processing code to send in 10 images at once (i.e. I now have the num property of the input blob set to 100 instead of 10).
However, the time required to process this batch is 10 times bigger than originally. Which means that I did not get any performance increase.
Is that reasonable or did I make something wrong?
I am running Caffe in CPU mode. Unfortunately GPU mode is not an option for me.
Update: Caffe now natively supports parallel processing of multiple images when using multiple GPUs. Though it seems relatively simple to implement base on the current implementation of GPU parallelism, at the moment there's no similar support for parallel processing on multiple CPUs.
Considering that the main problem with implementing parallelism is the syncing you need during training If you just want to process your images in parallel (as opposed to training the model), then you could load several copies of the same network to memory (whether through python with multiprocessing or c++ with multi-threading), and process each image on a different network. It would be simple and quite effective, especially if you load the networks once and then just process a large amount of images. Nevertheless, GPUs are much faster :)
Caffe doesn't process multiple images in parallel, the only saving you get by batch processing several images is in the time it takes to transfer the image data back and forth between Caffe's framework, which could be significant when dealing with the GPU.
IIRC there are several attempts to make Caffe process images in parallel, but most focus on the GPU implementation (CUDNN, CUDA Streams etc.), with few attempts to add parallelism to the CPU code (OpenBLAS's multithread mode, or simply running on multiple threads). Of those I believe only the CUDNN option is currently part of the stable version of Caffe, but obviously requires a GPU. You can try to look at one of the pull requests about this matter on Caffe's github page and see if it works for you, but note that it might cause compatibilities issue with your current version.
This is one such version that in the past I've used, though it's no longer maintained: https://github.com/BVLC/caffe/pull/439
I've also noticed in the last comment of the above issue that there's some speed up to the CPU code on this pull request as well, though I've never tried it myself: https://github.com/BVLC/caffe/pull/2610
I've been working on a facetracking system last couple of months and now I need to make everything run in parallel to increase the performance.
The main cpp file is:
int _tmain(int argc, _TCHAR* argv[])
{
cFrame.initCamFrames(20, 1600, 1200, 3); //INITIATES BUFFER FOR CAM FRAMES, 20 frames, res:1600x1200, 3bytes per pixel.
eyeTracking.initTrackingSystem(&cFrame); //INITIATES EYETRACKING SOFTWARE WITH POINTER TO THE BUFFER WHERE EYETRACKINGSOFTWARE GETS THE FRAMES TO SEARCH WITHIN. (opencv)
directShow directShowClass;
directShowClass.initiateDirectShow(false, &cFrame); //INITIATES DIRECTSHOW WITH POINTER TO BUFFER WHERE IT SHOULD SAVE FRAMES FROM CAM
directShowClass.runDirectShow(); //START CAPTURING FRAMES INTO BUFFER
eyeTracking.runTrackingSystem(); //START SEARCH FOR FACE AND EYES.
system("pause");
directShowClass.stopDirectShow();
}
I want "directShowClass.runDirectShow();" and "eyeTracking.runTrackingSystem();" to run in real parallel. now I think that they run as threads in pseudo-parallel. (simple printf in each method occur mixed up in the terminal).
I guess that making a program run in parallel is not that simple as I would like it to be. But I guess that it is possible :D
Please give me some advise where to start searching for information about how to paralellisize.
I have a dual core processor.
Thanks!
Printf is not thread-safe,aka it can mix up buffers like you encountered. You can run the process in pseudo-parallel (like switch each call to another processing step) or run it in hardware-concurrency (std::thread, pthreads, windows thread, boost::thread).
if you have a dual core processor you surely can take advantage of multi-core processing, I would suggest to use boost.
Just to clear out, by using threads you do get real parallelism. But remember that your computer is also running in it's cores other processes in background that occupy CPU time, so your functions are not always being executed.
In order to get some parallelism in C++ you have many options. I name three:
. The oldest most common way is to use the pthread library, which is built in into almost every compiler.
. The new C++ standard, called C++11 includes some native libraries to deal with multi-threading, you can check that out, but it is still not supported by every compiler. And most compilers that support it have only partial functionality. You also need to activate the standard explicitly. For example:
gcc -std=c++11
. Finally, if you are in the mood for some "higher level" stuff, you can put some effort in learning about the OpenMP framework, which uses pragma directives to annotate parallel tasks. The framework will then deal with all the creation of threads, so you can use your time in some other stuff.
P.S: The reason why the output comes out mixed is not because the threads run in pseudo-parallel, but because they are concurrently writting on the buffer. So when the buffer is dumped you see it as the threads wrote it. If any, this is a proof that they are actually running in parallel, but you are making them write their output in the same buffer ;)
I want to be able to get the current % CPU usage in a C++ program running under Wince.
I found this link that states where the source code is but I cannot find it in my platform builder installation - I expect this is because it isn't the Windows Automotive platform.
Does anyone know where I can find this source code or (even better) know how I can get this information directly? i.e. what DLL / function calls to make etc.
Since GetProcessTimes doesn't exist in CE, you have to calculate this.
You have to start with the toolhelp APIs to enumerate the processes and the threads in the processes. You then call GetThreadTimes for each thread and add all that up.
Bear in mind that the act of calculating this info will affect the CPU utilization.
I have found that GetIdleTime (or CeGetIdleTimeEx on WEC7 or newer) works well for calculating system-wide processor usage. Sample code for calculating processor idle time percentage is shown on GetIdleTime MSDN page. Obviously, processor utilization percentage can then be calculated by subtracting the idle time percentage from 100.
The MSDN page does warn that support for GetIdleTime is dependent on OAL implementation.
Note that when using the toolhelp APIs to calculate the CPU usage, you need to take two measurements, then calculate the difference. when doing so, you won't know how much CPU any threads that were terminated before the second sample took.
So, applications that often create short-lived threads will not be represented properly in your result.
You can look into Remote Task Monitor. It will let you get the current % CPU usage of your process (or thread), exactly what you are looking for. It also is very light weight, does not impact your device much.
Is is possible to measure CPU per thread on a windows mobile (or CE 5) device programmatically (c++)? If not, is their a utility that will monitor the CPU usage of a process?
CPU usage cannot be directly measured because, unlike an x86, the ARM processor doesn't have a register for it. You can calculate it using the Toolhelp APIs to get a list of processes and their child threads and then use GetThreadTimes to figure out how much time each thread is using.
Keep in mind that doing this calculation directly affects how much the CPU is in use.
Someone wrote a tool that looks a lot like Task Manager on the PC:
http://www.vttoth.com/LPK/taskmanager.html
As ctacke says, it does seem to use a lot of the CPU. It reports uses ~15%-30% of our CPU on our 800MHz ARM device.