Parallel Program: How to find the bottleneck (CPU bound threads) - c++

I have written a parallel program using OpenMP. It uses two threads because my laptop is dual core and the threads do a lot of matrix operations, so they are CPU bound. There is no data sharing among the threads. A single instance of the program runs quite fast. But when I run multiple instances of the same program simultaneously, the performance degrades. Here is a plot:
The running time for a single instance (two threads) is 0.78 seconds. The running time for two instances (total of four threads) is 2.06, which is more than double of 0.78. After that, the running time increases in proportion with the number of instances (number of threads).
Here is the timing profile of one of the instances when multiple were run in parallel:
Can someone offer insights into what could be going on? The profile shows that 50% of the time is being consumed by OpenMP. What does that mean?

Similar to what #Bort said, you made the application multithreaded (two threads) because you have two cores.
This means that when only one instance of your program is running (ideally) it gets to use the whole CPU.
However, if two instances of the application are running, there are no more resources available. They will each take twice the time. Same for more instances.
You cannot fix this issue without also increasing the number of cores available for each instance (i.e. keeping it at 2 per instance, rather than a shrinking percentage).

Related

Performance of threads and processes in linux

I have two scenarios in Linux that I've been working for some time in the same machine. The machine has two xeon processors each with 8 cores and 16 threads.
I have one code in c++ that is parallelized with openmp. In this scenario, if I use all threads (32 in total according to the Linux kernel) do I have any penalties in terms of concurrence between the threads ? I mean, setting 32 threads is the optimal configuration for this scenario ?
I run a given number of processes (all single threaded) using the same binary. Basically I have a script that spawn the same binary with different input files. In this scenario, what is the best way to launch these processes and not exhaust the machine ? I think that if I run 32 processes at the same time I will harm the performance of the machine.
The optimal one will generally be something between 16 and 32 for CPU-bound tasks (hyperthreaded cores compete for the same resources); for memory-bound or even IO-bound tasks it can be even lower.
Still, in most cases using as many threads as cores can be a good starting point.
Why should it be harmful? In Linux, threads are just processes that happen to share the virtual address space (and most other OS resources). If you have enough RAM to keep them running without pagingĀ¹ and each process is single thread, 32 is as ok as per the thread case.
notice that the situation would be pretty much the same for an equivalent multithreaded program, as the program code is shared between the various instances of the application.

Does each Map have its own thread?

Does each Map have its own thread? So, when we do splitting, we should split the task for as many Map function as we have processors available? Or there's some other way, besides threads, where we can run map functions in parallel?
I assume you're speaking about hadoop mapreduce implementation. Also, I assume you're speaking about cores workload.
For the intro, the number of map tasks for a given job is derived from the number of input data splits. Then, those tasks are scheduled to task nodes, where mappers are started, up to mapred.tasktracker.map.tasks.maximum per node. This configuration paramether may differ for different nodes, for example in case of different computational power. I'll add an illustration from one of my other answers on SO:
The mappers by default, runs in a different JVM and there can be multiple JVMs running at any particular instance on a node, up to mapred.tasktracker.map.tasks.maximum. Those JVMs are recreated for each starting map task, or can be reused
for several consecutive runs. Won't dig in details, but this setting can also affect performance due to tradeoff between memory fragmentation and JVM instatiation overhead.
Proceeding to your question, amount of cores loaded by running JVMs is controlled by underlying OS, which does balance load and optimize computations. One can expect that different JVMs will be executed over different cores, if possible. One can expect performance degradation if number of mappers exceeds number of cores in general case. I have skewed usecases where latter is not true.
An example:
Say you have job splitted in 100 map tasks, to be run on 2 task nodes with 2 cpu unit each, with mapred.tasktracker.map.tasks.maximum equal to 2. Then, most of the time (except when waiting on mappers to start) your 100 elements task will be executed 4 at a given time, resulting (in average) in 50 tasks completed by each node.
And last, but not least. For mapper task, it is common to not to have CPU as bottleneck, but IO. In that case, it's not uncommon to get better results with many small on CPU machines vs a few servers huge on CPU.

Why this C++ code don't reach 100% usage of one core?

I just made some benchmarks for this super question/answer Why is my program slow when looping over exactly 8192 elements?
I want to do benchmark on one core so the program is single threaded. But it doesn't reach 100% usage of one core, it uses 60% at most. So my tests are not acurate.
I'm using Qt Creator, compiling using MinGW release mode.
Are there any parameters to setup for better performance ? Is it normal that I can't leverage CPU power ? Is it Qt related ? Is there some interruptions or something preventing code to run at 100%...
Here is the main loop
// horizontal sums for first two lines
for(i=1;i<SIZE*2;i++){
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
}
// rest of the computation
for(;i<totalSize;i++){
// compute horizontal sum for next line
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
// final result
resPointer[i-SIZE]=(hsumPointer[i-SIZE-SIZE]+hsumPointer[i-SIZE]+hsumPointer[i])/9;
}
This is run 10 times on an array of SIZE*SIZE float with SIZE=8193, the array is on the heap.
There could be several reasons why Task Manager isn't showing 100% CPU usage on 1 core:
You have a multiprocessor system and the load is getting spread across multiple CPUs (most OSes will do this unless you specify a more restrictive CPU affinity);
The run isn't long enough to span a complete Task Manager sampling period;
You have run out of RAM and are swapping heavily, meaning lots of time is spent waiting for disk I/O when reading/writing memory.
Or it could be a combination of all three.
Also Let_Me_Be's comment on your question is right -- nothing here is QT's fault, since no QT functions are being called (assuming that the objects being read and written to are just simple numeric data types, not fancy C++ objects with overloaded operator=() or something). The only activities taking place in this region of the code are purely CPU-based (well, the CPU will spend some time waiting for data to be sent to/from RAM, but that is counted as CPU-in-use time), so you would expect to see 100% CPU utilisation except under the conditions given above.

how to design threading for many short tasks

I want to use multi-threads to accelerate my program, but not sure which way is optimal.
Say we have 10000 small tasks, it takes maybe only 0.1s to finish one of them. Now I have a CPU with 12 cores and I want to use 12 threads to make it faster.
So far as I know, there are two ways:
1.Tasks Pool
There are always 12 threads running, each of them get one new task from the tasks pool after it finished its current work.
2.Separate Tasks
By separating the 10000 tasks into 12 parts and each thread works on one part.
The problem is, if I use tasks pool it is a waste of time for lock/unlock when multiple threads try to access the tasks pool. But the 2nd way is not ideal because some of the threads finish early, the total time depends on the slowest thread.
I am wondering how you deal with this kind of work and any other best way to do it? Thank you.
EDIT: Please note that the number 10000 is just for example, in practice, it may be 1e8 or more tasks and 0.1 per task is also an average time.
EDIT2: Thanks for all your answers :] It is good to know kinds of options.
So one midway between the two approaches is to break into say 100 batches of 100 tasks each and let the a core pick a batch of 100 tasks at a time from the task pool.
Perhaps if you model the randomness in execution time in a single core for a single task, and get an estimate of mutex locking time, you might be able to find an optimal batch size.
But without too much work we at least have the following lemma :
The slowest thread can only take at max 100*.1 = 10s more than others.
Task pool is always the best solution here. It's not just optimum time, it's also comprehensibility of code. You should never force your tasks to conform to the completely unrelated criteria of having the same number of subtasks as cores - your tasks have nothing to do with that (in general), and such a separation doesn't scale when you change machines, etc. It requires overhead to collaborate on combining results in subtasks for the final task, and just generally makes an easy task hard.
But you should not be worrying about the use of locks for taskpools. There are lockfree queues available if you ever determined them necessary. But determine that first. If time is your concern, use the appropriate methods of speeding up your task, and put your effort where you will get the most benefit. Profile your code. Why do your tasks take 0.1 s? Do they use an inefficient algorithm? Can loop unrolling help? If you find the hotspots in your code through profiling, you may find that locks are the least of your worries. And if you find everything is running as fast as possible, and you want that extra second from removing locks, search the internet with your favorite search engine for "lockfree queue" and "waitfree queue". Compare and swap makes atomic lists easy.
Both ways suggested in the question will perform well and similarly to each another (in simple cases with predictable and relatively long duration of the tasks). If the target system type is known and available (and if performance is really a top concern), the approach should be chosen based on prototyping and measurements.
Do not necessarily prejudice yourself as to the optimal number of threads matching the number of the cores. If this is a regular server or desktop system, there will be various system processes kicking in here and then and you may see your 12 threads variously floating between processors which hurts memory caching.
There are also crucial non-measurement factors you should check: do those small tasks require any resources to execute? Do these resources impose additional potential delays (blocking) or competition? Are there additional apps competing for the CPU power? Will the application need to be grow to accommodate different execution environments, task types, or user interaction models?
If the answer to all is negative, here are some additional approaches that you can measure and consider.
Use only 10 or 11 threads. You will observe a small slowdown, or even
a small speedup (the additional core will serve OS processes, so that
thread affinity of the rest will become more stable compared to 12
threads). Any concurrent interactive activity on the system will see
a big boost in responsiveness.
Create exactly 12 threads but explicitly set a different processor
affinity mask to each, to impose a 1-1 mapping between threads and processors.
This is good in the simplest near-academical case
where there are no resources other than CPU and shared memory
involved; you will see no chronic migration of threads across
processes. The drawback is an
algorithm closely coupled to a particular machine; on another machine
it could behave so poorly as to finish never at all (because of an
unrelated real time task that
blocks one of your threads forever).
Create 12 threads and split the tasks evenly. Have each thread
downgrade its own priority once it is past 40% and again once it is
past 80% of its load. This will improve load balancing inside your
process, but it will behave poorly if your application is competing
with other CPU-bound processes.
100ms/task - pile 'em on as they are - pool overhead will be insignificant.
OTOH..
1E8 tasks # 0.1s/task = 10,000,000 seconds
= 2777.7r hours
= 115.7 days
That's much more than the interval between patch Tuesday reboots.
Even if you run this on Linux, you should batch up the output and flush it to disk in such a manner that the job is restartable.
Is there a database involved? If so, you should have told us!
Each working thread may have its own small task queue with the capacity of no more than one or two memory pages. When the queue size becomes low (a half of capacity) it should send a signal to some manager thread to populate it with more tasks. If queue is organized in batches then working threads do not need to enter critical sections as long as current batch is not empty. Avoiding critical sections will give you extra cycles for actual job. Two batches per queue are enough, and in this case one batch can take one memory page, and so queue takes two.
The point of memory pages is that thread does not have to jump all over the memory to fetch data. If all data are in one place (one memory page) you avoid cache misses.

Is concurrent programming the same as parallel programming?

Are they both the same thing? Looking just at what concurrent or parallel means in geometry, I'd definetely say no:
In geometry, two or more lines are said to be concurrent if they intersect at a single point.
and
Two lines in a plane that do not
intersect or meet are called parallel
lines.
Again, in programming, do they have the same meaning? If yes...Why?
Thanks
I agree that the geometry vocabulary is in conflict. Think of train tracks instead: Two trains which are on parallel tracks can run independently and simultaneously with little or no interaction. These trains run concurrently, in parallel.
The basic usage difficulty is that "concurrent" can mean "at the same time" (with the trains or code) or "at the same place" (with the geometric lines). For many practical purposes (trains, thread resources) these two notions are directly in conflict.
Natural language is supposed to be silly, ambiguous, and confusing. But we're programmers. We can take refuge in the clarity, simplicity, and elegance of our formal programming languages. Like perl.
From Wikipedia:
Concurrent computing is a form of
computing in which programs are
designed as collections of interacting
computational processes that may be
executed in parallel.
Basically, programs can be written as concurrent programs if they are made up of smaller interacting processes. Parallel programming is actually doing these processes at the same time.
So I suppose that concurrent programming is really a style that lends itself to processes being executed in parallel to improve performance.
No, definitely concurrent is different from parallel. here is exactly how.
Concurrency refers to the sharing of resources in the same time frame. As an example, several processes may share the same CPU or share memory or an I/O device.
Now, by definition two processes are concurrent if an only if the second starts execution before the first has terminated (on the same CPU). If the two processes both run on the same - say for now - single-core CPU the processes are concurrent but not parallel: in this case, parallelism is only virtual and refers to the OS doing timesharing. The OS seems to be executing several processes simultaneously. If there is only one single-core CPU, only one instruction from only one process can be executing at any particular time. Since the human time scale is billions of times slower than that of modern computers, the OS can rapidly switch between processes to give the appearance of several processes executing at the same time.
If you instead run the two processes on two different CPUs, the processes are parallel: there is no sharing in the same time frame, because each process runs on its own CPU. The parallelism in this case is not virtual but physical. It is worth noting here that running on different cores of the same multi-core CPU still can not be classified as fully parallel, because the processes will share the same CPU caches and will even contend for them.