TBB Parallel Pipeline: Filter Timing Inconsistent

TBB Parallel Pipeline: Filter Timing Inconsistent - c++

I'm programming an application that processes a video stream using a `tbb::parallel_pipeline'. My first filter contains two important operations, one that must occur immediately after the other.
My tests show that the delay between two operations is anywhere from 3 to 20 milliseconds when I set max_number_of_live_tokens to 6 (# of filters I have) but is consistently 3 to 4 milliseconds when max_number_of_live_tokens is 1. The jitter in the first case is unacceptable for my application, but I need to allow multiple tokens to be in flight simultaneously to exploit parallelism.
Here my pipeline setup:
tbb::parallel_pipeline(6, //max_number_of_live_tokens
// 1st Filter
tbb::make_filter< void, shared_ptr<PipelinePacket_t> >(tbb::filter::serial_in_order,
[&](tbb::flow_control& fc)->shared_ptr<PipelinePacket_t>
{
shared_ptr<PipelinePacket_t> pPacket = grabFrame();
return pPacket;
}
)
&
... // 5 other filters that process the image - all 'serial_in_order'
);
And here is my grabFrame() function:
shared_ptr<VisionPipeline::PipelinePacket_t> VisionPipeline::grabFrame() {
shared_ptr<PipelinePacket_t> pPacket(new PipelinePacket_t);
m_cap >> pPacket->frame; // Operation A (use opencv api to capture frame)
pPacket->motion.gyroDeg = m_imu.getGyroZDeg(); // Operation B (read a gyro value)
return pPacket;
}
I need operations A and B to happen as close as possible to each other (so that the gyro value reflects its value at the time the frame was captured).
My guess is that the jitter that occurs when multiple tokens are in flight simultaneously is caused by tasks from other filters running on the same thread as the first filter and interrupting it while grabFrame() executing. I've dug through some TBB documentation but can't find anything on how parallel_pipeline breaks up filters into tasks, so it is unclear to me if TBB is somehow breaking up grabFrame() into multiple TBB tasks or not.
Is this assumption correct? If so, how can I tell tbb not to interrupt the first filter between operations A and B with other tasks?

OpenCV is using TBB internally itself, for various operations. So if this is actually related to TBB, it's not as you were interrupted between A and B, but rather OpenCV itself is fighting for priority with the remainder of the filter chain. Unlikely though.
so it is unclear to me if TBB is somehow breaking up grabFrame() into multiple TBB tasks or not.
That is never happening. Unless there are parts in there explicitly dispatching via TBB, it has no effect whatsoever. TBB is not magically splitting your functions into tasks.
But that may not even be your only issue. If your filters happen to be heavy on memory bandwidth, it's likely the case that you are slowing down the actual capture process significantly just by concurrent execution of the image processing.
Looks like you are running the full image through 5 filters in a row, is that correct? Full resolution, not tiled? If so, most of these filter are likely not ALU constrained, but rather by memory bandwidth, as you are not staying within CPU cache bounds either.
If you wish to go parallel, you must get rid of the write-backs to main memory in between the filter stages. The only way to do that, is to either start tiling the images in the input filter of the filter chain, or to write a custom all-in-one filter kernel. If you have filters in that chain with spatial dependencies, that's obviously not as easy as I make it sound, then you have to include some overlap in the upper stages.
max_number_of_live_tokens then actually has a real meaning. It's the number of "tiles" in flight. Which is not primarily intended to limit each filter stage to only one concurrent execution (that's not happening anyway), but rather to keep the maximum working set size under control.
E.g. if you know that each of your tiles is now 128kB in size, you know that there are 2 copies involved in each filter (source and destination), and you know you have a 2MB L3 cache, then you would know that you can afford to have 8 tokens in flight without spilling to main memory. If you also happen to have (at least) 8 CPU cores, that yields ideal throughput, but even if you don't, at least you are not risking to become bottle-necked by exceeding cache size. Of course you can afford some spilling to main memory (past what you calculated to be safe), but then you have to perform in-depth profiling of your system to see if you are getting constrained.

Related

why does having more than one thread(parallel processing) in some specific cases degrade performance?

i noticed that having more than a thread running for some code is much much slower than having one thread, and i have been really pulling my hair to know why,can anyone help?
code explanation :
i have ,sometimes, a very large array that i need to process parts of in a parallel way for optimization,each "part" of a row gets looped on and processed on in a specific thread, now i've noticed that if i only have one "part",i.e the whole array and a single worker thread that runs through it is noticeably faster than if i divide the array and process it as separate sub arrays with different threads.
bool m_generate_row_worker(ull t_row_start,ull t_row_end)
{
for(;t_row_start<t_row_end;t_row_start++)
{
m_current_row[t_row_start]=m_singularity_checker(m_previous_row[t_row_start],m_shared_random_row[t_row_start]);
}
return true;
}
...
//code
...
for(unsigned short thread_indx=0;thread_indx<noThreads-1;thread_indx++)
{
m_threads_array[thread_indx]=std::thread(
m_generate_row_worker,this,
thread_indx*(m_parts_per_thread),(thread_indx+1)*(m_parts_per_thread));
}
m_threads_array[noThreads-1]=std::thread(m_generate_row_worker,this,
(noThreads-1)*(m_parts_per_thread),std::max((noThreads)*(m_parts_per_thread),m_blocks_per_row));
//join
for(unsigned short thread_indx=0;thread_indx<noThreads;thread_indx++)
{
m_threads_array[thread_indx].join();
}
//EDIT
inline ull m_singularity_checker(ull t_to_be_ckecked_with,ull
t_to_be_ckecked)
{
return (t_to_be_ckecked & (t_to_be_ckecked_with<<1)
& (t_to_be_ckecked_with>>1) ) | (t_to_be_ckecked_with &
t_to_be_ckecked);
}

why does having more than one thread(parallel processing) in some specific cases degrade performance?
Because thread creation has overhead. If the task to be performed has only small computational cost, then the cost of creating multiple threads is more than the time saved by parallelism. This is especially the case when creating significantly more threads than there are CPU cores.
Because many algorithms do not easily divide into independent sub-tasks. Dependencies on other threads requires synchronisation, which has overhead that can in some cases be more than the time saved by parallelism.
Because in poorly designed programs, synchronization can cause all tasks to be processed sequentially even if they are in separate threads.
Because (depending on CPU architecture) sometimes otherwise correctly implemented, and seemingly independent tasks have effectual dependency because they operate on the same area of memory. More specifically, when a threads writes into a piece of memory, all threads operating on the same cache line must synchronise (the CPU does this for you automatically) to remain consistent. The cost of cache misses is often much higher than the time saved by parallelism. This problem is called "false sharing".
Because sometimes introduction of multi threading makes the program more complex, which makes it more difficult for the compiler / optimiser to make use of instruction level parallelism.
...
In conclusion: Threads are not a silver bullet that automatically multiplies the performance of your program.
Regarding your program, we cannot count out any of the above potential issues given the excerpt that you have shown.
Some tips on avoiding or finding above issues:
Don't create more threads than you have cores, discounting the number of threads that are expected to be blocking (waiting for input, disk, etc).
Only use multi-threading with problems that are computationally expensive, (or to do work while a thread is blocking, but this may be more efficiently solved using asynchronous I/O and coroutines).
Don't do (or do as little as possible) I/O from more than one thread into a single device (disk, NIC, virtual terminal, ...) unless it is specially designed to handle it.
Minimise the number of dependencies between threads. Consider all access to global things that may cause synchronisation, and avoid them. For example, avoid memory allocation. Keep in mind that things like operations on standard containers do memory allocation.
Keep the memory touched by distinct threads far from each other (not adjacent small elements of array). If processing an array, divide it in consecutive blocks, rather than striping one element every (number of threads)th element. In some extreme cases, extra copying into thread specific data structures, and then joining in the end may be efficient.
If you've done all you can, and multi threading measures slower, consider whether perhaps it is not a good solution for your problem.

Using threads do not always mean that you will get more work done. For example using 2 threads does not mean you will get a task done in half the time. There is an overhead to setting up the threads and depending on how many cores and OS etc... how much context switching is occurring between threads (saving the thread stack/regs and loading the next one - it all adds up). At some point adding more threads will start to slow your program down since there will be more time spent switching between threads/setting threads up/down then there is work being done. So you may be a victim of this.
If you have 100 very small items (like 1 instruction) of work to do, then 100 threads will be guaranteed to be slower since you now have ("many instructions" + 1) x 100 of work to do. Where the "many instructions" are the work of setting up the threads and clearing them up at the end - and switching between them.
So, you may want to start to profile this for yourself.. How much work is done processing each row and how many threads in total are you setting up?
One very crude, but quick/simple way to start to measure is to just take the time elapsed to processes one row in isolation (e.g. use std::chrono functions to measure the time at the start of processing one row and then take the time at the end to see total time spent. Then maybe do the same test over the entire table to get an idea how total time.
If you find that a individual row is taking very little time then you may not be getting so much benefit from the threads... You may be better of splitting the table into chunks of work that are equal to the number of cores your CPU has, then start changing the number of threads (+/-) to find the sweet spot. Just making threads based on number of rows is a poor choice - you really want to design it to max out each core (for example).
So if you had 4 cores, maybe start by splitting the work into 4 threads to start with. Then test it with 8 if its better try 16, if its worse try 12....etc...
Also you might get different results on different PCs...

Intel Tbb overhead issue

Im using Intel TBB to parallel processing some parts of an algorithm processed on images. Although the processing for each pixel is data dependent, there are some cases which 2 consecutive pixels could be processed in parallel as below.
ProcessImage(image)
for each row in image // Create and wait root task for each line here using allocate_root()
ProcessRow(row)
for each 2 pixel
if(parallel())
ProcessPixel(A) and ProcessPixel(B) in parallel // For testing, create and process 2 tbb::empty_task() here as child tasks
else
ProcessPixel(A)
ProcessPixel(B)
However, the overhead occurs because this processing is very fast. For each input image (size of 512x512), the processing costs about 5-6 ms.
When I experimentally used Intel TBB as comment block above, the processing costs more than 25 ms.
So is there any better way using Intel TBB without overhead issue or other more efficient way to improve performance of simple and fast processing program like this ?

TBB does not add such a big (~20ms) overheads for invocation of a parallel algorithm. My guess (since there is no specifics provided) is that it is related to one of the following:
If you measure only the first invocation, it includes overheads for worker threads creation. And note, TBB does not have barriers like OpenMP, so one call to parallel_for might not be enough to create all the threads)
Same situation happens after worker threads go to sleep because of absence of the parallel work for them. The overheads for the wakeup are orders of magnitude lower than for the threads creation but still can affect measurements and impose wrong conclusions.
TBB scheduler can steal a task from outer level to the nested level (blocking call) thus the measurements will look like it takes too long for processing the nested part only while it is busy with an extra work there.
There is a contention for processing (A) and (B) in parallel caused by either explicit (e.g. mutex) or implicit (e.g. false sharing) reasons. But anyway, it is not TBB-specific.
Thus, the advice for performance measurements with TBB is to consider only the total time for long enough sequence of computations that will hide initialization overheads.
And of course as was advised, parallel first on the outer level. TBB provides enough different patterns for that including tbb::parallel_pipeline and tbb::flow::graph

Cuda Stream Processing for multiple kernels Disambiguation

Hi a few questions regarding Cuda stream processing for multiple kernels.
Assume s streams and a kernels in a 3.5 capable kepler device, where s <= 32.
kernel uses a dev_input array of size n and a dev output array of size s*n.
kernel reads data from input array, stores its value in a register, manipulates it and writes its result back to dev_output at the position s*n + tid.
We aim to run the same kernel s times using one of the n streams each time. Similar to the simpleHyperQ example. Can you comment if and how any of the following affects concurrency please?
dev_input and dev_output are not pinned;
dev_input as it is vs dev_input size s*n, where each kernel reads unique data (no read conflicts)
kernels read data from constant memory
10kb of shared memory are allocated per block.
kernel uses 60 registers
Any good comments will be appreciated...!!!
cheers,
Thanasio
Robert,
thanks a lot for your detailed answer. It has been very helpful. I edited 4, it is 10kb per block. So in my situation, i launch grids of 61 blocks and 256 threads. The kernels are rather computationally bound. I launch 8 streams of the same kernel. Profile them and then i see a very good overlap between the first two and then it gets worse and worse. The kernel execution time is around 6ms. After the first two streams execute almost perfectly concurrent the rest have a 3ms distance between them. Regarding 5, i use a K20 which has a 255 register file. So i would not expect drawbacks from there. I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s..
Please take a look at the following link. There is an image called kF.png .It shows the profiler output for the streams..!!!
https://devtalk.nvidia.com/default/topic/531740/cuda-programming-and-performance/concurrent-streams-and-hyperq-for-k20/

Concurrency amongst kernels depends upon a number of factors, but one that many people overlook is simply the size of the kernel (i.e. number of blocks in the grid.) Kernels that are of a size that can effectively utilize the GPU by themselves will not generally run concurrently to a large degree, and there would be little throughput advantage even if they did. The work distributor inside the GPU will generally begin distributing blocks as soon as a kernel is launched, so if one kernel is launched before another, and both have a large number of blocks, then the first kernel will generally occupy the GPU until it is nearly complete, at which point blocks of the second kernel will then get scheduled and executed, perhaps with a small amount of "concurrent overlap".
The main point is that kernels that have enough blocks to "fill up the GPU" will prevent other kernels from actually executing, and apart from scheduling, this isn't any different on a compute 3.5 device. In addition, rather than just specifying a few parameters for the kernel as a whole, also specifying launch parameters and statistics (such as register usage, shared mem usage, etc.) at the block level are helpful for providing crisp answers. The benefits of the compute 3.5 architecture in this area will still mainly come from "small" kernels of "few" blocks, attempting to execute together. Compute 3.5 has some advantages there.
You should also review the answer to this question.
When global memory used by the kernel is not pinned, it affects the speed of data transfer, and also the ability to overlap copy and compute but does not affect the ability of two kernels to execute concurrently. Nevertheless, the limitation on copy and compute overlap may skew the behavior of your application.
There shouldn't be "read conflicts", I'm not sure what you mean by that. Two independent threads/blocks/grids are allowed to read the same location in global memory. Generally this will get sorted out at the L2 cache level. As long as we are talking about just reads there should be no conflict, and no particular effect on concurrency.
Constant memory is a limited resource, shared amongst all kernels executing on the device (try running deviceQuery). If you have not exceeded the total device limit, then the only issue will be one of utilization of the constant cache, and things like cache thrashing. Apart from this secondary relationship, there is no direct effect on concurrency.
It would be more instructive to identify the amount of shared memory per block rather than per kernel. This will directly affect how many blocks can be scheduled on a SM. But answering this question would be much crisper also if you specified the launch configuration of each kernel, as well as the relative timing of the launch invocations. If shared memory happened to be the limiting factor in scheduling, then you can divide the total available shared memory per SM by the amount used by each kernel, to get an idea of the possible concurrency based on this. My own opinion is that number of blocks in each grid is likely to be a bigger issue, unless you have kernels that use 10k per grid but only have a few blocks in the whole grid.
My comments here would be nearly the same as my response to 4. Take a look at deviceQuery for your device, and if registers became a limiting factor in scheduling blocks on each SM, then you could divide available registers per SM by the register usage per kernel (again, it makes a lot more sense to talk about register usage per block and the number of blocks in the kernel) to discover what the limit might be.
Again, if you have reasonable sized kernels (hundreds or thousands of blocks, or more) then the scheduling of blocks by the work distributor is most likely going to be the dominant factor in the amount of concurrency between kernels.
EDIT: in response to new information posted in the question. I've looked at the kF.png
First let's analyze from a blocks per SM perspective. CC 3.5 allows 16 "open" or currently scheduled blocks per SM. If you are launching 2 kernels of 61 blocks each, that may well be enough to fill the "ready-to-go" queue on the CC 3.5 device. Stated another way, the GPU can handle 2 of these kernels at a time. As the blocks of one of those kernels "drains" then another kernel is scheduled by the work distributor. The blocks of the first kernel "drain" sufficiently in about half the total time, so that the next kernel gets scheduled about halfway through the completion of the first 2 kernels, so at any given point (draw a vertical line on the timeline) you have either 2 or 3 kernels executing simultaneously. (The 3rd kernel launched overlaps the first 2 by about 50% according to the graph, I don't agree with your statement that there is a 3ms distance between each successive kernel launch). If we say that at peak we have 3 kernels scheduled (there are plenty of vertical lines that will intersect 3 kernel timelines) and each kernel has ~60 blocks, then that is about 180 blocks. Your K20 has 13 SMs and each SM can have at most 16 blocks scheduled on it. This means at peak you have about 180 blocks scheduled (perhaps) vs. a theoretical peak of 16*13 = 208. So you're pretty close to max here, and there's not much more that you could possibly get. But maybe you think you're only getting 120/208, I don't know.
Now let's take a look from a shared memory perspective. A key question is what is the setting of your L1/shared split? I believe it defaults to 48KB of shared memory per SM, but if you've changed this setting that will be pretty important. Regardless, according to your statement each block scheduled will use 10KB of shared memory. This means we would max out around 4 blocks scheduled per SM, or 4*13 total blocks = 52 blocks max that can be scheduled at any given time. You're clearly exceeding this number, so probably I don't have enough information about the shared memory usage by your kernels. If you're really using 10kb/block, this would more or less preclude you from having more than one kernel's worth of threadblocks executing at a time. There could still be some overlap, and I believe this is likely to be the actual limiting factor in your application. The first kernel of 60 blocks gets scheduled. After a few blocks drain (or perhaps because the 2 kernels were launched close enough together) the second kernel begins to get scheduled, so nearly simultaneously. Then we have to wait a while for about a kernel's worth of blocks to drain before the 3rd kernel can get scheduled, this may well be at the 50% point as indicated in the timeline.
Anyway I think the analyses 1 and 2 above clearly suggest you're getting most of the capability out of the device, based on the limitations inherent in your kernel structure. (We could do a similar analysis based on registers to discover if that is a significant limiting factor.) Regarding this statement: "I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s.." I hope you see that the concurrency spec (e.g. 32 kernels) is a maximum spec, and in most cases you are going to run into some other kind of machine limit before you hit the limit on the maximum number of kernels that can execute simultaneously.
EDIT: regarding documentation and resources, the answer I linked to above from Greg Smith provides some resource links. Here are a few more:
The C programming guide has a section on Asynchronous Concurrent Execution.
GPU Concurrency and Streams presentation by Dr. Steve Rennich at NVIDIA is on the NVIDIA webinar page

My experience with HyperQ so far is 2-3 (3.5) times parallellization of my kernels, as the kernels usually are larger for a little more complex calculations. With small kernels its a different story, but usually the kernels are more complicated.
This is also answered by Nvidia in their cuda 5.0 documentation that more complex kernels will take down the amount of parallellization.
But still, GK110 has a great advantage just allowing this.

OpenMP and OOP (Molecular Dynamics Simulation)

I’m conducting a molecular dynamics simulation, and I’ve been struggling for quite a while to implement it in parallel, and although I succeeded in fully loading my 4-thread processor, the computation time in parallel is greater than the computation time in serial mode.
Studying at which point of time each thread starts and finishes its loop iteration, I’ve noticed a pattern: it’s as if different threads are waiting for each other.
It was then that I turned my attention to the structure of my program. I have a class, an instance of which represents my system of particles, containing all the information about particles and some functions that use this information. I also have a class instance of which represents my interatomic potential, containing parameters of potential function along with some functions (one of those functions calculates force between two given particles).
And so in my program there exist instances of two different classes, and they interact with each other: some functions of one class take references to instances of another class.
And the block I’m trying to implement in parallel looks like this:
void Run_simulation(Class_system &system, Class_potential &potential, some other arguments){
#pragma omp parallel for
for(…)
}
for(...) is the actual computation, using data from the system instance of the Class_system class and some functions from thepotential instance of the Class_potential class.
Am I right that it’s this structure that’s the source of my troubles?
Could you suggest me what has to be done in this case? Must I rewrite my program in completely different manner? Should I use some different tool to implement my program in parallel?

Without further details on your simulation type I can only speculate, so here are my speculations.
Did you look into the issue of load balancing? I guess the loop distributes the particles among threads but if you have some kind of a restricted range potential, then the computational time might differ from particle to particle in the different regions of the simulation volume, depending on the spatial density. This is a very common problem in molecular dynamics and one that is very hard to solve properly in distributed memory (MPI in most cases) codes. Fortunately with OpenMP you get direct access to all particles at each computing element and so the load balancing is much easier to achieve. It is not only easier, but it is also built-in, so to speak - simply change the scheduling of the for directive with the schedule(dynamic,chunk) clause, where chunk is a small number whose optimal value might vary from simulation to simulation. You might make chunk part of the input data to the program or you might instead write schedule(runtime) and then play with different scheduling classes by setting the OMP_SCHEDULE environment variable to values like "static", "dynamic,1", "dynamic,10", "guided", etc.
Another possible source of performance degradation is false sharing and true sharing. False sharing occurs when your data structure is not suitable for concurrent modification. For example, if you keep 3D positional and velocity information for each particle (let's say you use velocity Verlet integrator), given IEEE 754 double precision, each coordinate/velocity triplet takes 24 bytes. This means that a single cache line of 64 bytes accommodates 2 complete triplets and 2/3 of another one. The consequence of this is that no matter how you distribute the particles among the threads, there would always be at lest two threads that would have to share a cache line. Suppose that those threads run on different physical cores. If one thread writes to its copy of the cache line (for example it updates the position of a particle), the cache coherency protocol would be involved and it will invalidate the cache line in the other thread, which would then have to reread it from a slower cache of even from main memory. When the second thread update its particle, this would invalidate the cache line in the first core. The solution to this problem comes with proper padding and proper chunk size choice so that no two threads would share a single cache line. For example, if you add a superficial 4-th dimension (you can use it to store the potential energy of the particle in the 4-th element of the position vector and the kinetic energy in the 4-th element of the velocity vector) then each position/velocity quadruplet would take 32 bytes and information for exactly two particles would fit in a single cache line. If you then distribute an even number of particles per thread, you automatically get rid of possible false sharing.
True sharing occurs when threads access concurrently the same data structure and there is an overlap between the parts of the structure, modified by the different threads. In molecular dynamics simulations this occurs very frequently as we want to exploit the Newton's third law in order to cut the computational time in two when dealing with pairwise interaction potentials. When one thread computes the force acting on particle i, while enumerating its neighbours j, computing the force that j exerts on i automatically gives you the force that i exerts on j so that contribution can be added to the total force on j. But j might belong to another thread that might be modifying it at the same time, so atomic operations have to be used for both updates (both, sice another thread might update i if it happens to neighbour one of more of its own particles). Atomic updates on x86 are implemented with locked instructions. This is not that horribly slow as often presented, but still slower than a regular update. It also includes the same cache line invalidation effect as with false sharing. To get around this, at the expense of increased memory usage one could use local arrays to store partial force contributions and then perform a reduction in the end. The reduction itself has to either be performed in serial or in parallel with locked instructions, so it might turn out that not only there is no gain from using this approach, but rather it could be even slower. Proper particles sorting and clever distribution between the processing elements so to minimise the interface regions can be used to tackle this problem.
One more thing that I would like to touch is the memory bandwidth. Depending on your algorithm, there is a certain ratio between the number of data elements fetched and the number of floating point operations performed at each iteration of the loop. Each processor has only a limited bandwidth available for memory fetches and if it happens that your data does not quite fit in the CPU cache, then it might happen that the memory bus is unable to deliver enough data to feed so many threads executing on a single socket. Your Core i3-2370M has only 3 MiB of L3 cache so if you explicitly keep the position, velocity and force for each particle, you can only store about 43000 particles in the L3 cache and about 3600 particles in the L2 cache (or about 1800 particles per hyperthread).
The last one is hyperthreading. As High Performance Mark has already noted, hyperthreads share a great deal of core machinery. For example there is only one AVX vector FPU engine that is shared among both hyperthreads. If your code is not vectorised, you lose a great deal of computing power available in your processor. If your code is vectorised, then both hyperthreads will get into each others way as they fight for control over the AVX engine. Hyperthreading is useful only when it is able to hide memory latency by overlaying computation (in one hyperthread) with memory loads (in another hyperthread). With dense numerical codes that perform many register operations before they perform memory load/store, hyperthreading gives no benefits whatsoever and you'd be better running with half the number of threads and explicitly binding them to different cores as to prevent the OS scheduler from running them as hyperthreads. The scheduler on Windows is particularly dumb in this respect, see here for an example rant. Intel's OpenMP implementation supports various binding strategies controlled via environment variables. GNU's OpenMP implementation too. I am not aware of any way to control threads binding (a.k.a. affinity masks) in Microsoft's OpenMP implementation.

Why this C++ code don't reach 100% usage of one core?

I just made some benchmarks for this super question/answer Why is my program slow when looping over exactly 8192 elements?
I want to do benchmark on one core so the program is single threaded. But it doesn't reach 100% usage of one core, it uses 60% at most. So my tests are not acurate.
I'm using Qt Creator, compiling using MinGW release mode.
Are there any parameters to setup for better performance ? Is it normal that I can't leverage CPU power ? Is it Qt related ? Is there some interruptions or something preventing code to run at 100%...
Here is the main loop
// horizontal sums for first two lines
for(i=1;i<SIZE*2;i++){
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
}
// rest of the computation
for(;i<totalSize;i++){
// compute horizontal sum for next line
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
// final result
resPointer[i-SIZE]=(hsumPointer[i-SIZE-SIZE]+hsumPointer[i-SIZE]+hsumPointer[i])/9;
}
This is run 10 times on an array of SIZE*SIZE float with SIZE=8193, the array is on the heap.

There could be several reasons why Task Manager isn't showing 100% CPU usage on 1 core:
You have a multiprocessor system and the load is getting spread across multiple CPUs (most OSes will do this unless you specify a more restrictive CPU affinity);
The run isn't long enough to span a complete Task Manager sampling period;
You have run out of RAM and are swapping heavily, meaning lots of time is spent waiting for disk I/O when reading/writing memory.
Or it could be a combination of all three.
Also Let_Me_Be's comment on your question is right -- nothing here is QT's fault, since no QT functions are being called (assuming that the objects being read and written to are just simple numeric data types, not fancy C++ objects with overloaded operator=() or something). The only activities taking place in this region of the code are purely CPU-based (well, the CPU will spend some time waiting for data to be sent to/from RAM, but that is counted as CPU-in-use time), so you would expect to see 100% CPU utilisation except under the conditions given above.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js