I need to boost cv::dft perfomance in multithreaded environment. I've done a simple test on Windows 10 on Core-i5 Intel processor:
Here I see that CPU is not fully loaded (50% usage only). Individual threads are loaded equally and also far from 100%. Why is that and how can I fix it? Can DFT easily pluralized? Is it implemented in OpenCV library? Are there special build flags to enable it (which)?
UPDATE: Running this code on linux gives a bit different result, but also below 100% utilization:
First of all, behavior of cv::dft depends on OpenCV build flags, for example if you set WITH_IPP, then it will use Intel Primitives to speedup computation. FFT is memory-bound, if you simply launch more threads, you most probably wouldn't significantly benefit from this parallelism, because threads will be waiting for each other to finish accessing memory, I've observed this both on Linux and Windows. To gain more performance you should use FFTW3 which has some sophisticated algorithm for multi-threaded mode (should be ./configure-d with special flag). I observed up to 7x speedup with 8 threads. But FFTW has only payed business-friendly license, imposing GNU license for your software. I have not found any other opensource components which can handle FFT parallelism in a smart way.
Related
I'm working on a project (Hardware: RaspberryPI 3B+), which has lots of computation and parallel processing. At present, I'm noticing some sort of lag in the code performance. Therefore, I'm constantly looking for efficient ways to improve my code and its performance.
Currently, I'm using C-language (because I can access and manipulate lower-level drivers easily) and developing my own set of functions, libraries and the drivers, which runs faster than any other pre-defined or readymade libraries or plugins.
Now, instead of the software-based muti-treading (Pthread), I wanted to use the separate cores for performing the corresponding task. So, any suggestion or guideline how I can use the different cores of the RaspberryPI?
Moreover, how can I check the CPU utilization to choose the best core to perform a certain task?
Thanking with regards,
Aatif Shaikh
At the C/C++ level you do not have access to which CPU core will run which thread. Just use the C++ 11 standard threads and let the OS scheduler to decide which thread runs where.
That said, Linux has the taskset tool to check thread affinity and there 's also sched_setaffinity() function.
I have an Android app with a C++ library which uses pthreads to break down rendering tasks. This is for devices running Android 4+.
Lets say I have a 100 x 100 array of elements into which I repetitively do CPU-intensive processing. Currently I'm breaking the array up into four 25 x 100 element chunks and handing it off to four Posix threads (from a pool of stalled, pre-created threads). This gives an almost 4x speed increase on iOS and desktop Mac but slower results than single-threading under Android.
So the same code is used successfully to speed up the app on iOS or desktop Mac but in Android it often makes it even slower.
I have done some tests on it and only quite big junks of data speed up when using multi threading. If the whole process (all threads) takes around 2 seconds or more it will speed up in multi threading mode but if it is less (say only takes about 400ms) it will be either the same speed or slower than just calling the rendering function normally. Which could point to thread switching being really slow. The bigger the processing tasks, the more they profit from multithreading. My tasks are usually not as big, but not fast enough in single threading mode.
I have also noticed that on ARM builds the speed difference between slower multi threading and the faster single threading is quite significant (almost twice as fast in multi threading rather than single threading) whereas on x86 builds the multi and single threaded versions will run at about the same speed as single threading on ARM builds. So x86 builds do not get slower on multithreading but also not faster.
Has anyone else had the same behaviour or knows where the slowdown could come from? Are there any special requirements for multithreading on Android? Unfortunately I can't really post any code at the moment but it is all standard posix threading code which works fine on iOS and Mac in general and has been in use for years.
Android vendors aggressively optimize for battery life which includes keeping number of cores (hot-plugged) and their individual (if possible) frequency low.
Generic idea for managing number of cores online is to keep an eye on system load for a period of time (window). If load persists and is above a threshold, system will bring necessary additional available cores online. Such decision taking afaik always happens via a user-level daemon. This approach is generally very different from desktops since being able to bring cores online/offline and benefit of it is mostly SoC dependent.
Managing cpu frequency is also similar, if load persists cpu freq is increased but there is a more settled mechanism for this provided by Linux called cpu-freq and due to that it is similar between desktop and mobile.
So it is very possible that you are creating a cpu load pattern that's not triggering core bring up or freq increase. (as you also describe within your description)
What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.
Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.
I'm developing a C/C++ application to manipulate large quantities of data in a generic way (aggregation/selection/transformation).
I'm using a AMD Phenom II X4 965 Black Edition, so with decent amount of different caches.
I've developed both ST and MT version of the functions to perform all the single operations and, not surprisingly, in the best case the MT version are 2x faster than the ST, even when using 4 cores.
Given I'm a fan of using 100% of available resources, I was pissed about the fact just 2x, I'd want 4x.
For this reason I've spent already quite a considerable amount of time with -pg and valgrind, using the cache simulator and callgraph. The program is working as expected and cores are sharing the input process data (i.e. operations to apply on data) and the cache misses are reported (as expected sic.) when the different threads load the data to be processed (millions of entities or rows if now you have an idea what I'm trying to do :-) ).
Eventually I've used different compilers, g++ and clang++, with -O3 both, and performance is identical.
My conclusion is that due to the large amount of data (GB of data) to process, given the fact the data has got to be loaded eventually in the CPU, this is real wait time.
Can I further improve my software? Have I hit a limit?
I'm using C/C++ on Linux x86-64, Ubuntu 11.10.
I'm all ears! :-)
What kind of application is it? Could you show us some code?
As I commented, you might have reached some hardware limit like RAM bandwidth. If you did, no software trick could improve it.
You might investigate using MPI, OpenMP, or OpenCL (on GPUs) but without an idea of your application we cannot help.
If compiling with GCC and if you want to help the processor cache prefetching, consider using with care and parsimony __builtin_prefetch (but using it too much or badly would decrease performance).
I am building an application that will do some object tracking from a video camera feed and use information from that to run a particle system in OpenGL. The code to process the video feed is somewhat slow, 200 - 300 milliseconds per frame right now. The system that this will be running on has a dual core processor. To maximize performance I want to offload the camera processing stuff to one processor and just communicate relevant data back to the main application as it is available, while leaving the main application kicking on the other processor.
What do I need to do to offload the camera work to the other processor and how do I handle communication with the main application?
Edit:
I am running Windows 7 64-bit.
Basically, you need to multithread your application. Each thread of execution can only saturate one core. Separate threads tend to be run on separate cores. If you are insistent that each thread ALWAYS execute on a specific core, then each operating system has its own way of specifying this (affinity masks & such)... but I wouldn't recommend it.
OpenMP is great, but it's a tad fat in the ass, especially when joining back up from a parallelization. YMMV. It's easy to use, but not at all the best performing option. It also requires compiler support.
If you're on Mac OS X 10.6 (Snow Leopard), you can use Grand Central Dispatch. It's interesting to read about, even if you don't use it, as its design implements some best practices. It also isn't optimal, but it's better than OpenMP, even though it also requires compiler support.
If you can wrap your head around breaking up your application into "tasks" or "jobs," you can shove these jobs down as many pipes as you have cores. Think of batching your processing as atomic units of work. If you can segment it properly, you can run your camera processing on both cores, and your main thread at the same time.
If communication is minimized for each unit of work, then your need for mutexes and other locking primitives will be minimized. Course grained threading is much easier than fine grained. And, you can always use a library or framework to ease the burden. Consider Boost's Thread library if you take the manual approach. It provides portable wrappers and a nice abstraction.
It depends on how many cores you have. If you have only 2 cores (cpu, processors, hyperthreads, you know what i mean), then OpenMP cannot give such a tremendous increase in performance, but will help. The maximum gain you can have is divide your time by the number of processors so it will still take 100 - 150 ms per frame.
The equation is
parallel time = (([total time to perform a task] - [code that cannot be parallelized]) / [number of cpus]) + [code that cannot be parallelized]
Basically, OpenMP rocks at parallel loops processing. Its rather easy to use
#pragma omp parallel for
for (i = 0; i < N; i++)
a[i] = 2 * i;
and bang, your for is parallelized. It does not work for every case, not every algorithm can be parallelized this way but many can be rewritten (hacked) to be compatible. The key principle is Single Instruction, Multiple Data (SIMD), applying the same convolution code to multiple pixels for example.
But simply applying this cookbook receipe goes against the rules of optimization.
1-Benchmark your code
2-Find the REAL bottlenecks with "scientific" evidence (numbers) instead of simply guessing where you think there is a bottleneck
3-If it is really processing loops, then OpenMP is for you
Maybe simple optimizations on your existing code can give better results, who knows?
Another road would be to run opengl in a thread and data processing on another thread. This will help a lot if opengl or your particle rendering system takes a lot of power, but remember that threading can lead to other kind of synchronization bottlenecks.
I would recommend against OpenMP, OpenMP is more for numerical codes rather than consumer/producer model that you seem to have.
I think you can do something simple using boost threads to spawn worker thread, common segment of memory (for communication of acquired data), and some notification mechanism to tell on your data is available (look into boost thread interrupts).
I do not know what kind of processing you do, but you may want to take a look at the Intel thread building blocks and Intel integrated primitives, they have several functions for video processing which may be faster (assuming they have your functionality)
You need some kind of framework for handling multicores. OpenMP seems a fairly simple choice.
Like what Pestilence said, you just need your app to be multithreaded. Lots of frameworks like OpenMP have been mentioned, so here's another one:
Intel Thread Building Blocks
I've never used it before, but I hear great things about it.
Hope this helps!