C++: Low CPU usage on Ubuntu multi core server - c++

I' m having a problem to run a c++ code on a powerful multi core server that uses Ubuntu. The problem is that my app is using less than 10% of one cpu. But same app uses around 100% of one cpu in my i3 notebook that uses a different version of Ubuntu.
My OS:
Linux version 3.11.0-23-generic (buildd#batsu) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #40~precise1-Ubuntu SMP Wed Jun 4 22:06:36 UTC 2014
The server's OS:
Linux version 3.11.0-12-generic (buildd#allspice) (gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu7) ) #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013
At least for now, I do not need to parallelize the code, nor to make my code more efficient. I just want to know how I can achieve 100% use of a core this server.
Could anyone help me?

It may not be your OS but instead the compiler. Compilers are moving targets, year by year they improve (hopefully) their automatic optimizations. Your code may still be vectorizing and you don't know it. Yes, I realize that you are using a newer compiler on your laptop.
See if you still have the performance delta when you disable all optimizations (-O0 or some such). If you are trying to maximize CPU cycles, you may be using numerical calculations that are easily vectorized. The same goes for parallelization. You can also get general optimization reports as well as a specific vectorization report from gcc. I don't recall the parameter, but you can find it easily on-line.
Also, there is a world of difference between the # of cores on a server (probably a multi-core Xeon) and your i3. Your i3 has 2 cores, each capable of running two hardware threads, meaning you have in effect 4 CPUs. Depending upon your server configuration, you can have up to 18 cores with two hardware threads each in a processor. That translates to 36 effective CPUs. Also, you can have multiple processors per motherboard. You can do the math.
Both the compiler and OS can impact an application's processor use. If you are forking off multiple threads to try and consume processing, the OS can farm those out to different processors, reducing your system wide CPU usage. Even if you are running pure serial code, a smart compiler can break up the code into multiple threads that your threading library may distribute over those 36 effective CPUs.
You OS can also meter how much processing you can hog. If this server is not under your control, the administrator may have established a policy that limits the percent of processing that any one application can consume.
So in conclusion:
(1) Disable all optimization
(2) Check on individual core CPU usage to see what the load is on all your effective CPUs
(3) Restructure your code to farm out tasks that will be distributed across your effective CPUs, each consuming as much processing as possible.
(4) Make sure your admin isn't limiting the amount of processing individual applications can consume.

Related

Is cv::dft transform multithreaded?

I need to boost cv::dft perfomance in multithreaded environment. I've done a simple test on Windows 10 on Core-i5 Intel processor:
Here I see that CPU is not fully loaded (50% usage only). Individual threads are loaded equally and also far from 100%. Why is that and how can I fix it? Can DFT easily pluralized? Is it implemented in OpenCV library? Are there special build flags to enable it (which)?
UPDATE: Running this code on linux gives a bit different result, but also below 100% utilization:
First of all, behavior of cv::dft depends on OpenCV build flags, for example if you set WITH_IPP, then it will use Intel Primitives to speedup computation. FFT is memory-bound, if you simply launch more threads, you most probably wouldn't significantly benefit from this parallelism, because threads will be waiting for each other to finish accessing memory, I've observed this both on Linux and Windows. To gain more performance you should use FFTW3 which has some sophisticated algorithm for multi-threaded mode (should be ./configure-d with special flag). I observed up to 7x speedup with 8 threads. But FFTW has only payed business-friendly license, imposing GNU license for your software. I have not found any other opensource components which can handle FFT parallelism in a smart way.

very interesting result from latency test of IPC on Linux 2.6.18

I am doing a performance (latency) test on unix socket under linux 2.6.18,
a process A writes 1024 bytes to process B on each 10 ms, and the result shows average latency is 20 us with small standard deviation(2~3 us).
The test becomes interesting when I run some additional CPU-bound processes simultaneously with process A&B, these new process is very cache-friendly such as a busy loop of simple math calculation, but as a result which surprises me, the IPC latency suddenly goes down, become 15 us on average.
As far as I know, to improve interactivity the O(1) scheduler(2.6 prior to 2.6.23) rewards IO-bound process by some heuristic method, but this can't explain why the speed becomes faster even than the first case.
I have also considered that if the Linux do some special case of busy-loop when process A was rewarded, but it seems not by further test.
This really confuses me.
my configuration:
CPU: Intel(R) Xeon(R) CPU E5-2609 0 # 2.40GHz with 10M L3 cache
MEM: 32G
OS: Linux 2.6.18-308.el5 SMP x86_64
I suspect that some power-saving feature of the hardware is at work here. A 10ms sleep is more than enough time for modern hardware to enter a low-power state. When you're looking at things at the microsecond level, there is a real, measurable latency to come out of a power-saving state.
My guess is that running the "busy" program in parallel prevents the hardware from entering a low power state. Standard things to try:
At the BIOS level, disable any and all power-saving features including C-states
At the OS level, disable cpuspeed (or whatever frequency scaling program your distro uses)
Try booting with the "idle=poll" kernel parameter
That last suggestion is especially important for Sandy Bridge CPUs (which is what you have), at least with RHEL/CentOS 5.x (which I'm guessing you're running). I found the Linux kernel would still override some BIOS settings. Other Linux kernel params that may help you:
intel_idle.max_state=0
processor.max_cstate=0

OpenCL Verification of Parallel Execution

What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.
Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.

How to optimize large data manipulation in parallel

I'm developing a C/C++ application to manipulate large quantities of data in a generic way (aggregation/selection/transformation).
I'm using a AMD Phenom II X4 965 Black Edition, so with decent amount of different caches.
I've developed both ST and MT version of the functions to perform all the single operations and, not surprisingly, in the best case the MT version are 2x faster than the ST, even when using 4 cores.
Given I'm a fan of using 100% of available resources, I was pissed about the fact just 2x, I'd want 4x.
For this reason I've spent already quite a considerable amount of time with -pg and valgrind, using the cache simulator and callgraph. The program is working as expected and cores are sharing the input process data (i.e. operations to apply on data) and the cache misses are reported (as expected sic.) when the different threads load the data to be processed (millions of entities or rows if now you have an idea what I'm trying to do :-) ).
Eventually I've used different compilers, g++ and clang++, with -O3 both, and performance is identical.
My conclusion is that due to the large amount of data (GB of data) to process, given the fact the data has got to be loaded eventually in the CPU, this is real wait time.
Can I further improve my software? Have I hit a limit?
I'm using C/C++ on Linux x86-64, Ubuntu 11.10.
I'm all ears! :-)
What kind of application is it? Could you show us some code?
As I commented, you might have reached some hardware limit like RAM bandwidth. If you did, no software trick could improve it.
You might investigate using MPI, OpenMP, or OpenCL (on GPUs) but without an idea of your application we cannot help.
If compiling with GCC and if you want to help the processor cache prefetching, consider using with care and parsimony __builtin_prefetch (but using it too much or badly would decrease performance).

AMD multi-core programming

I want to start to write applications(C++) that will utilize the additional cores to execute portions of the code that have a need to perform lots of calculations and whose computations are independent of each other. I have the following processor : x64 Family 15 Model 104 Stepping 2 Authentic AMD ~1900 Mhz running on Windows Vista Home premium 32 bit and Opensuse 11.0 64 bit.
On the Intel platforms , I've used the following APIs Intel TBB, OpenMP. Do they work on AMD and does AMD have similar APIs.what has been your experience?
OpenMP and TBB are both available also for AMD - it is also a compiler question.
E.g. see linux TBB on AMD.
I think the latest development on this end is to use the graphic card via CUDA or similar APIs- but this depends on the nature of your calculations. If it fits, it is faster than the CPU anyway.