openMP: Running with all threads in parallel leads to out-of-memory-exceptions - c++

I want to shorten the runtime of an lengthy image processing algorithm, which is applied to multiple images by using parallel processing with openMP.
The algorithm works fine with single or limited number (=2) of threads.
But: The parallel processing with openMP requires lots of memory, leading to out-of-memory-exceptions, when running with the maximum number of possible threads.
To resolve the issue, I replaced the "throwing of exceptions" with a "waiting for free memory" in case of low memory, leading to many (<= all) threads just waiting for free memory...
Is there any solution/tool/approach to dynamically maintain the memory or start threads depending on available memory?

Try compiling your program 64-bit. 32-bit programs can only have up to 2^32 = about 4GB of memory. 64-bit programs can use significantly more (2^64 which is 18 exabytes). It's very easy to hit 4GB of memory these days.
Note that if you are using more RAM than you have available, your OS will have to page some memory to disk. This can hurt performance a lot. If you get to this point (where you are using a significant portion of RAM) and still have extra cores, you would have to go deeper into the algorithm to find a more granular section to parallelize.
If you for some reason can't switch to 64-bit, you can do multiprocessing (running multiple instances of a program) so each process will have up to 4GB. You will need to launch and coordinate the processes somehow. Depending on your needs, this could mean using simple command-line arguments or complicated inter-process communication (IPC). OpenMP doesn't do IPC, but Open MPI does. Open MPI is generally used for communication between many nodes on a network, but it can be set up to run concurrent instances on one machine.

Related

Performance of threads and processes in linux

I have two scenarios in Linux that I've been working for some time in the same machine. The machine has two xeon processors each with 8 cores and 16 threads.
I have one code in c++ that is parallelized with openmp. In this scenario, if I use all threads (32 in total according to the Linux kernel) do I have any penalties in terms of concurrence between the threads ? I mean, setting 32 threads is the optimal configuration for this scenario ?
I run a given number of processes (all single threaded) using the same binary. Basically I have a script that spawn the same binary with different input files. In this scenario, what is the best way to launch these processes and not exhaust the machine ? I think that if I run 32 processes at the same time I will harm the performance of the machine.
The optimal one will generally be something between 16 and 32 for CPU-bound tasks (hyperthreaded cores compete for the same resources); for memory-bound or even IO-bound tasks it can be even lower.
Still, in most cases using as many threads as cores can be a good starting point.
Why should it be harmful? In Linux, threads are just processes that happen to share the virtual address space (and most other OS resources). If you have enough RAM to keep them running without paging¹ and each process is single thread, 32 is as ok as per the thread case.
notice that the situation would be pretty much the same for an equivalent multithreaded program, as the program code is shared between the various instances of the application.

Am I disturbing other programs with OpenMP?

I'm using OpenMP for a loop like this:
#pragma omp parallel for
for (int out = 1; out <= matrix.rows; out++)
{
...
}
I'm doing a lot of computations on a machine with 64 CPUs. This works quite qell but my question is:
Am I disturbing other users on this machine? Usually they only run single thread programms. Will they still run on 100%? Obviously I will disturb other multithreads programms, but will I disturb single thread programs?
If yes, can I prevend this? I think a can set the maximum number of CPUs with omp_set_num_threads. I can set this to 60, but I don't think this is the best solution.
The ideal solution would disturb no other single thread programs but take as much CPUs as possible.
Every multitasking OS has something called a process scheduler. This is an OS component that decides where and when to run each process. Schedulers are usually quite stubborn in the decisions they make but those could often be influenced by various user-supplied policies and hints. The default configuration for almost any scheduler is to try and spread the load over all available CPUs, which often results in processes migrating from one CPU to another. Fortunately, any modern OS except "the most advanced desktop OS" (a.k.a. OS X) supports something called processor affinity. Every process has a set of processors on which it is allowed to execute - the so-called CPU affinity set of that process. By configuring disjoint affinity sets to various processes, those could be made to execute concurrently without stealing CPU time from each other. Explicit CPU affinity is supported on Linux, FreeBSD (with the ULE scheduler), Windows NT (this also includes all desktop versions since Windows XP), and possibly other OSes (but not OS X). Every OS then provides a set of kernel calls to manipulate the affinity and also an instrument to do that without writing a special program. On Linux this is done using the sched_setaffinity(2) system call and the taskset command line instrument. Affinity could also be controlled by creating a cpuset instance. On Windows one uses the SetProcessAffinityMask() and/or SetThreadAffinityMask() and affinities can be set in Task Manager from the context menu for a given process. Also one could specify the desired affinity mask as a parameter to the START shell command when starting new processes.
What this all has to do with OpenMP is that most OpenMP runtimes for the listed OSes support under one form or another ways to specify the desired CPU affinity for each OpenMP thread. The simplest control is the OMP_PROC_BIND environment variable. This is a simple switch - when set to TRUE, it instructs the OpenMP runtime to "bind" each thread, i.e. to give it an affinity set that includes a single CPU only. The actual placement of threads to CPUs is implementation dependent and each implementation provides its own way to control it. For example, the GNU OpenMP runtime (libgomp) reads the GOMP_CPU_AFFINITY environment variable, while the Intel OpenMP runtime (open-source since not long ago) reads the KMP_AFFINITY environment variable.
The rationale here is that you could limit your program's affinity in such a way as to only use a subset of all the available CPUs. The remaining processes would then get predominantly get scheduled to the rest of the CPUs, though this is only guaranteed if you manually set their affinity (which is only doable if you have root/Administrator access since otherwise you can modify the affinity only of processes that you own).
It is worth mentioning that it often (but not always) makes no sense to run with more threads than the number of CPUs in the affinity set. For example, if you limit your program to run on 60 CPUs, then using 64 threads would result in some CPUs being oversubscribed and in timesharing between the threads. This will make some threads run slower than the others. The default scheduling for most OpenMP runtimes is schedule(static) and therefore the total execution time of the parallel region is determined by the execution time of the slowest thread. If one thread timeshares with another one, then both threads will execute slower than those threads that do not timeshare and the whole parallel region would get delayed. Not only this reduces the parallel performance but it also results in wasted cycles since the faster threads would simply wait doing nothing (possibly busy looping at the implicit barrier at the end of the parallel region). The solution is to use dynamic scheduling, i.e.:
#pragma omp parallel for schedule(dynamic,chunk_size)
for (int out = 1; out <= matrix.rows; out++)
{
...
}
where chunk_size is the size of the iteration chunk that each thread gets. The whole iteration space is divided in chunks of chunk_size iterations and are given to the worker threads on a first-come-first-served basis. The chunk size is an important parameter. If it is too low (the default is 1), then there could be a huge overhead from the OpenMP runtime managing the dynamic scheduling. If it is too high, then there might not be enough work available for each thread. It makes no sense to have chunk size bigger than maxtrix.rows / #threads.
Dynamic scheduling allows your program to adapt to the available CPU resources, even if they are not uniform, e.g. if there are other processes running and timesharing with the current one. But it comes with a catch: big system like your 64-core one usually are ccNUMA (cache-coherent non-uniform memory access) systems, which means that each CPU has its own memory block and access to the memory block(s) of other CPU(s) is costly (e.g. takes more time and/or provides less bandwidth). Dynamic scheduling tends to destroy data locality since one could not be sure that a block of memory, which resides on one NUMA, won't get utilised by a thread running on another NUMA node. This is especially important when data sets are large and do not fit in the CPU caches. Therefore YMMV.
Put your process on low priority within the operating system. Use a many resources as you like. If someone else needs those resources the OS will make sure to provide them, because they are on a higher (i.e. normal) priority. If there are no other users you will get all resources.

Dual socket vs single socket memory model?

I am a bit confused about what memory looks like in a dual CPU machine from the perspective of a C/C++ program running on Linux.
Case 1 (understood)
With one quad-core HT CPU, 32GB RAM, I can, in theory, write a single process application, using up to 8 threads and up to 32GB RAM without going into swap or overloading the threading facilities - I am ignore the OS and other processes here for simplicity.
Case 2 (confusion)
What happens with a dual quad-core HT CPU with 64GB RAM set up?
Development-wise, do you need to write an application to run as two processes (8 threads, 32GB each) that communicate or can you write it as one process (16 threads, 64GB full memory)?
If the answer is the former, what are some efficient modern strategies to utilize the entire hardware? shm? IPC? Also, how do you direct Linux to use a different CPU for each process?
From the application's viewpoint, the number of physical CPUs (dies) doesn't matter. Only the number of virtual processors. These include all cores on all processors, and double, if hyperthreading is enabled on a core. Threads are scheduled on them in the same way. It doesn't matter if the cores are all on one die or spread across multiple dies.
In general, the best way to handle these things is to not. Don't worry about what's running on which core. Just spawn an appropriate number of threads for your application, (up to a theoretical maximum equal to the total number of cores in the system), and let the OS deal with the scheduling.
The memory is shared amongst all cores in the system, of course. But again, it's up the OS to handle allocation of physical memory. Very few applications really need to worry about how much memory they use, and divvying up that memory between threads. Let the OS handle that.
The memory model has ** nothing ** to do with number of cores per se, rather it has to do with the architecture employed on multi core computers. Most mainstream computers use symmetric multi processing model, wherein a single OS is controlling all the CPUs, and programs running on those CPUs have access to all the available memory. Each CPU does has private memory (cache), but the RAM is all shared. So if you have 64 bit machine it makes zilch difference whether you write 1 process, or two processes AS FAR AS memory usage implications are concerned. Programming wise you would be better to use a single process.
As other pointed out, you do need to worry about thread affinities and such, but that has more to do with efficient use of CPU resources, and little to do with RAM usage. There would be some implications of cache usage though.
Contrast with other memory model computers, like NUMA (Non-Uniform Memory Access), where each CPU has its own block of memory, and communicating across CPUs then requires some arbiter in between. On these computers you WOULD NEED to worry about where to place your threads, memory wise.

What does taskset in linux exactly do?

I have a program running on a 32 core system using Intel TBB.
The problem I have is when I set the program to use 32 threads, the performance doesn't gain enough compared to 16 threads (only 50% boost). However, when I use:
taskset 0xFFFFFFFF ./foo
which would lock the process to 32 cores, the performance is much better.
I have the two following questions:
Why? By Default, the OS would use all 32 cores for the 32 thread program anyway.
I'm assuming that even with taskset, the OS is allowed (would) to exchange the virtual threads and the physical threads, i.e. threads are not pinned. am I right?
Thanks.
The operating system may choose to use less cores for cache purposes. Imagine if the application uses the same set of memory then each write causes a cache invalidate. Forcing the lock is essentially you telling the OS the cache overhead for concurrency is not worth it, go ahead and use all the cores.
You must also remember there are other processes to run (like kthreads from the kernel, and background processes.) and migrating threads between cores is costly and may cause imbalances if your threads are not doing an even amount of work.
Also remember that the OS tries to evenly distribute work on the cores across ALL processes not just yours. This means that the load balancer may choose to not place your process on all 32 cores as there are other processes currently running and migration costs could be high or spreading your process evenly could cause load imbalance among the cpu cores. The OS strives for best system performance not necessarily best per application performance.

Multiple instances of program on multi-core machine

I am assuming a dual-core (2 cores per processors) machine with 2 processors for the questions that follow; so a total of 4 "cores". So some natural questions arose:
Suppose I wrote a simple serial program and built it in, say, Visual Studio.. and ran the same program twice, say, with distinct input data in each run. Would they be running on the same processor? Or distinct processors? How much RAM memory would be assigned to each? Would it be the RAM memory on 1 processor (2 cores) or the total RAM? I believe the two programs would run on distinct processors and should each have RAM memory of 1 processor (2 cores); but I am not 100% certain. Would the behavior be any different on Linux?
Now suppose my program was written using a distributed memory parallel interface such as MPI and that I ran it once with 2 processors in the np argument (say). Would the program use both processors (and in effect all 4 cores)? Is this the optimal value for the argument -np? In other words, if I did the same with -np 3 or -np 4; is it correct to assume there would be no added advantage? Again, I think so, but I am not 100% certain. I assume also that I could go higher than 4 (-np 5, -np 6, etc). In such cases, how do the processes compete for memory at values of np > 4? Would the performance get worse for np > 4. I think yes, and perhaps this partly depends on problem size, but again not 100% sure.
Next, suppose I ran two instances of my MPI-built parallel program, both with -np 2, each with, say, different input data. First off, is this possible? I assume it is and that they each run on both processors? How are the two programs synchronized and how do they individually compete for memory sequentially? This should atleast in part, be based on the order of launching the programs, presumably?
Lastly, suppose my program was written using a shared memory parallel interface such as OpenMP and that I ran it once. How many "threads" can I run it on to make full use of shared memory parallelism - is it 2 or 4? (since I have 2 processors with 2 cores each). My guess is it is 4; since all 4 cores are part of the a single shared memory unit? Is that correct? If the answer is 4; does it make sense to run on greater than 4 threads? I am not sure this even works (unlike MPI, where I believe we can do -np 5, -np 6 and so on).
Finally, suppose I run 2 instances of the shared memory parallel program, each with, say, different input data. I assume this is possible and that the individual processes would somehow compete for memory, presumably in the order the programs were launched?
Which processor they run on is entirely up to the OS and depends on many factors, including whatever else is happening on the same machine. The common case, though, is that they will tend to sit on one core each, occasionally swapping to different cores ("occasionally" may mean several times a second or even more frequently).
Çores don't have their own RAM on normal PC hardware, and the processes will be given however much RAM they ask for.
For MPI processes, yes, your parallelism should match the core count (assuming a CPU-heavy workload). If two MPI processes run with -np 2, they will simply consume all four cores. Increase anything and they'll start to contend. As explained above, RAM has nothing to do with any of this, though cache will suffer in the presence of contention.
This "question" is way too long, so I'm going to stop now.
#Marcelo is absolutely right and I'd like to just expand on his answer a little bit.
The OS will determine where and when the threads the comprise the application execution depending on what else is going on in the system and the available resources. Each application will run in it's own process and that process can have hundereds or thousands of threads. The OS (Windows, Linux, Mac whatever) will switch the execution context of the processing cores to ensure that all applications and services get a slice of the pie.
As for I/O access to such things as RAM that is physically controlled by the NorthBridge Controller that sits on your motherboard. Each process (not processor!) will have an allocated amount of RAM that it can deal with that can expand or contract over the lifetime of the application... this of course is limited to the amount of resources available on the system, and also worth noting the OS will take care of swapping RAM requests beyond it's physically availability to disk (i.e. Virtual RAM).
On the other hand though you will need to coordinate access to memory within your application through the use of critical sections and other thread synchronising mechanisms.
OpenMP is a library that helps you write multithreaded parellel applications and makes the syntax of keeping threads in sync easier.... I would comment more, but it's been quite a while since I've used it and I'm sure someone could give a better explaination.
I see you are using windows, so I will summarize by saying that you can set process affinities (which core or cores a process can run on) in the task manager. There's also a winapi call but the name escapes me
a) for a single threaded program, they will not launch on the same cpu (assuming its cpu bound). You can guarantee it by changing the affinity. in linux there's a call sched_setaffinity and a userspace program taskset
b) depends on the MPI library; the machinery is library-specific.
c) it depends on the specific application and data pattern. For small data accesses but lots of messaging passing, you may actually find limiting to 1 CPU to be the most efficient pattern.