OpenMP and memory limitation - c++

I have a c++ application that uses openmp parallel construct.
The method inside the for loop uses a lot of memory. It allocates memory at start and release them at the end.
If the system has enough memory, it works well, but if there is not enough memory, the operation fails.
The target system may have enough memory so only 2 thread can be run in parallel or maybe 3 thread can be run in parallel.
Is there any way to configure openmp so it does know how many thread it should use based on available memory?
If OpenMP can no do this, is there any way that I can do this by myself?

OpenMP is very dumb when it comes to monitoring memory usage and you would have to implement it by yourself. A good strategy would be to obtain the amount of usable memory and then to divide it by the memory requirement per thread in order to get the upper limit of the number of threads that can process data concurrently. Once you know that number, you could force the parallel region to run with that many threads using the num_threads clause:
int max_threads = mem_size / mem_per_thread;
#pragma omp parallel for num_threads(max_threads)
for (...)
{
}
Now the hard question is how to obtain the amount of usable memory, especially given that virtually all modern operating systems implement virtual memory. One solution would be to leave that to the end user, e.g. provide a parameter in your program's configuration that the user can set to a specific value that he deems reasonable. Another strategy could be to set the value to a given % of the physical memory size.

Related

cannot allocate memory fast enough?

Assume you are tasked to address a performance bottleneck in an application. Via profiling we discover the bottleneck is related to memory allocation. We discover that the application can only perform N memory allocations per second, no matter how many threads we have allocating memory. Why would we be seeing this behavior and how might we increase the rate at which the application can allocate memory. (Assume that we cannot change the size of the memory blocks that we are allocating. Assume that we cannot reduce the use of dynamically allocated memory.)
Okay, a few solutions exist - however almost all of them seem to be excluded via some constraint or another.
1. Have more threads allocate memory
We discover that the application can only perform N memory allocations per second, no matter how many threads we have allocating memory.
From this, we can cross-off any ideas of adding more threads (since "no matter how many threads"...).
2. Allocate more memory at a time
Assume that we cannot change the size of the memory blocks that we are allocating.
Fairly obviously, we have to allocate the same block size.
3. Use (some) static memory
Assume that we cannot reduce the use of dynamically allocated memory.
This one I found most interesting.. Reminded me of a story I heard about a FORTRAN programmer (before Fortran had dynamic memory allocation) whom just used a HUGE static array allocated on the stack as a private heap.
Unfortunately, this constraint prevents us from using such a trick.. However, it does give a glean into one aspect of a (the) solution.
My Solution
At the start of execution (either of the program, or on a per-thread basis) make several^ memory allocation system calls. Then use the memory from these later in the program (along with the existing dynamic memory allocations).
* Note: The 'several' would probably be an exact number, determined from your profiling, which the question mentions in the beginning.
TL;DR
The trick is to modify the timing of the memory allocations.
Looks like a challenging problem, though without details, you can only do some guesses. (Which is most likely the idea of this question)
The limitation here is the number of allocations, not the size of the allocation.
If we can assume that you are in control of where it allocations occur, you can allocate the memory for multiple instances at once. Please consider the code below as pseudo code, as it's only for illustration purpose.
const static size_t NR_COMBINED_ALLOCATIONS = 16;
auto memoryBuffer = malloc(size_of(MyClass)*NR_COMBINED_ALLOCATIONS);
size_t nextIndex = 0;
// Some looping code
auto myNewClass = new(memoryBuffer[nextIndex++]) MyClass;
// Some code
myNewClass->~MyClass();
free(memoryBuffer);
Your code will most likely become a lot more complex, though you will most likely tackle this bottleneck. In case you have to return this new class, you even need even more code just to do memory management.
Given this information, you can write your own implementation of allocators for your STL, override the 'new' and 'delete' operators ...
If that would not be enough, try challenging the limitations. Why can you only do a fixed number of allocations, is this because of unique locking? If so, can we improve this? Why do you need that many allocations, would changing the algorithm that is being used fix this issue ...
... the application can only perform N memory allocations per second,
no matter how many threads we have allocating memory. Why would we be
seeing this behavior and how might we increase the rate at which the
application can allocate memory.
IMHO, the most likely cause is that the allocations are coming from a common system pool.
Because they share a pool, each thread has to gain access thru some critical section blocking mechanism (perhaps a semaphore).
The more threads competing for dynamic memory (i.e. using new) will cause more critical section blocking.
The context switch between tasks is the time waste here.
How increase the rate?
option 1 - serialize the usage ... and this means, of course, that you can not simply try to use a semaphore at another level. For one system I worked on, a high dynamic memory utilization happened during system start up. In that case, it was easiest to change the start up such that thread n+1 (of this collection) only started after thread n had completed its initialization and fell into its wait-for-input loop. With only 1 thread doing its start up thing at a time, (and very few other dynamic memory users yet running) no critical section blockage occurred. 4 simultaneous start ups would take 30 seconds. 4 serialized startups finished in 5 seconds.
option 2 - provide a pool of ram and a private new/delete for each particular thread. If only one thread access a pool at a time, a critical section or semaphore is not needed. In an embedded system, the challenge here is allocate a reasonable amount of private pool for the thread and not too much waste. On a desktop with multi-gigabytes of ram, this is probably less of a problem.
I believe you could use a separate thread which could be responsible for memory allocation. This thread would have a queue containing a map of thread identifiers and needed memory allocation. Threads would not directly allocate memory, but rather send an allocation request to the queue and go into a wait state. The queue, on its turn would try to process each requested memory allocation from the queue and wake the corresponding sleeping thread up. When the thread responsible for memory handling can not process an allocation due to limitation, it should wait until memory can be allocated again.
One could build another layer into the solution as #Tersosauros's solution suggested to slightly optimize speed, but it should be based on something like the idea above nonetheless.

Am I disturbing other programs with OpenMP?

I'm using OpenMP for a loop like this:
#pragma omp parallel for
for (int out = 1; out <= matrix.rows; out++)
{
...
}
I'm doing a lot of computations on a machine with 64 CPUs. This works quite qell but my question is:
Am I disturbing other users on this machine? Usually they only run single thread programms. Will they still run on 100%? Obviously I will disturb other multithreads programms, but will I disturb single thread programs?
If yes, can I prevend this? I think a can set the maximum number of CPUs with omp_set_num_threads. I can set this to 60, but I don't think this is the best solution.
The ideal solution would disturb no other single thread programs but take as much CPUs as possible.
Every multitasking OS has something called a process scheduler. This is an OS component that decides where and when to run each process. Schedulers are usually quite stubborn in the decisions they make but those could often be influenced by various user-supplied policies and hints. The default configuration for almost any scheduler is to try and spread the load over all available CPUs, which often results in processes migrating from one CPU to another. Fortunately, any modern OS except "the most advanced desktop OS" (a.k.a. OS X) supports something called processor affinity. Every process has a set of processors on which it is allowed to execute - the so-called CPU affinity set of that process. By configuring disjoint affinity sets to various processes, those could be made to execute concurrently without stealing CPU time from each other. Explicit CPU affinity is supported on Linux, FreeBSD (with the ULE scheduler), Windows NT (this also includes all desktop versions since Windows XP), and possibly other OSes (but not OS X). Every OS then provides a set of kernel calls to manipulate the affinity and also an instrument to do that without writing a special program. On Linux this is done using the sched_setaffinity(2) system call and the taskset command line instrument. Affinity could also be controlled by creating a cpuset instance. On Windows one uses the SetProcessAffinityMask() and/or SetThreadAffinityMask() and affinities can be set in Task Manager from the context menu for a given process. Also one could specify the desired affinity mask as a parameter to the START shell command when starting new processes.
What this all has to do with OpenMP is that most OpenMP runtimes for the listed OSes support under one form or another ways to specify the desired CPU affinity for each OpenMP thread. The simplest control is the OMP_PROC_BIND environment variable. This is a simple switch - when set to TRUE, it instructs the OpenMP runtime to "bind" each thread, i.e. to give it an affinity set that includes a single CPU only. The actual placement of threads to CPUs is implementation dependent and each implementation provides its own way to control it. For example, the GNU OpenMP runtime (libgomp) reads the GOMP_CPU_AFFINITY environment variable, while the Intel OpenMP runtime (open-source since not long ago) reads the KMP_AFFINITY environment variable.
The rationale here is that you could limit your program's affinity in such a way as to only use a subset of all the available CPUs. The remaining processes would then get predominantly get scheduled to the rest of the CPUs, though this is only guaranteed if you manually set their affinity (which is only doable if you have root/Administrator access since otherwise you can modify the affinity only of processes that you own).
It is worth mentioning that it often (but not always) makes no sense to run with more threads than the number of CPUs in the affinity set. For example, if you limit your program to run on 60 CPUs, then using 64 threads would result in some CPUs being oversubscribed and in timesharing between the threads. This will make some threads run slower than the others. The default scheduling for most OpenMP runtimes is schedule(static) and therefore the total execution time of the parallel region is determined by the execution time of the slowest thread. If one thread timeshares with another one, then both threads will execute slower than those threads that do not timeshare and the whole parallel region would get delayed. Not only this reduces the parallel performance but it also results in wasted cycles since the faster threads would simply wait doing nothing (possibly busy looping at the implicit barrier at the end of the parallel region). The solution is to use dynamic scheduling, i.e.:
#pragma omp parallel for schedule(dynamic,chunk_size)
for (int out = 1; out <= matrix.rows; out++)
{
...
}
where chunk_size is the size of the iteration chunk that each thread gets. The whole iteration space is divided in chunks of chunk_size iterations and are given to the worker threads on a first-come-first-served basis. The chunk size is an important parameter. If it is too low (the default is 1), then there could be a huge overhead from the OpenMP runtime managing the dynamic scheduling. If it is too high, then there might not be enough work available for each thread. It makes no sense to have chunk size bigger than maxtrix.rows / #threads.
Dynamic scheduling allows your program to adapt to the available CPU resources, even if they are not uniform, e.g. if there are other processes running and timesharing with the current one. But it comes with a catch: big system like your 64-core one usually are ccNUMA (cache-coherent non-uniform memory access) systems, which means that each CPU has its own memory block and access to the memory block(s) of other CPU(s) is costly (e.g. takes more time and/or provides less bandwidth). Dynamic scheduling tends to destroy data locality since one could not be sure that a block of memory, which resides on one NUMA, won't get utilised by a thread running on another NUMA node. This is especially important when data sets are large and do not fit in the CPU caches. Therefore YMMV.
Put your process on low priority within the operating system. Use a many resources as you like. If someone else needs those resources the OS will make sure to provide them, because they are on a higher (i.e. normal) priority. If there are no other users you will get all resources.

Thread IDs with PPL and Parallel Memory Allocation

I have a question about the Microsoft PPL library, and parallel programming in general. I am using FFTW to perform a large set (100,000) of 64 x 64 x 64 FFTs and inverse FFTs. In my current implementation, I use a parallel for loop and allocate the storage arrays within the loop. I have noticed that my CPU usage only tops out at about 60-70% in these cases. (Note this is still better utilization than the built in threaded FFTs provided by FFTW which I have tested). Since I am using fftw_malloc, is it possible that excessive locking is occurring which is preventing full usage?
In light of this, is it advisable to preallocate the storage arrays for each thread before the main processing loop, so no locks are required within the loop itself? And if so, how is this possible with the MSFT PPL library? I have been using OpenMP before, in that case it is simple enough to get a thread ID using supplied functions. I have not however seen a similar function in the PPL documentation.
I am just answering this because nobody has posted anything yet.
Mutex(e)s can wreak havoc on performance if heavy locking is required. In addition if a lot of memory (re)-allocation is needed, that can also decrease performance and limit it to your memory bandwidth. Like you said a preallocation which later threads operate on can be usefull. However this requires that you have a fixed threadcount and that you spread your workload balanced on all threads.
Concerning the PPL thread_id functions, I can only speak about Intel-TBB, which however should be pretty similiar to PPL. TBB - and I suppose also PPL - is not speaking of threads directly, instead they are talking about tasks, the aim of TBB was to abstract these underlaying details away from the user, thus it does not provide a thread_id function.
Using PPL I have had good performance with an application that does a lot of allocations by using a Concurrency::combinable to hold a structure containing memory allocated per thread.
In fact you don't have to pre-allocate you can check the value of your combinable variable with ->local() and allocate it if it is null. Next time this thread is called it will already be allocated.
Of course you have to free the memory when all task are done which can be done using:
with something like:
combine_each([](MyPtr* p){ delete p; });

Can i allocate memory faster by using multiple threads?

If i make a loop that reserves 1kb integer arrays, int[1024], and i want it to allocate 10000 arrays, can i make it faster by running the memory allocations from multiple threads?
I want them to be in the heap.
Let's assume that i have a multi-core processor for the job.
I already did try this, but it decreased the performance. I'm just wondering, did I just make bad code or is there something that i didn't know about memory allocation?
Does the answer depend on the OS? please tell me how it works on different platforms if so.
Edit:
The integer array allocation loop was just a simplified example. Don't bother telling me how I can improve that.
It depends on many things, but primarily:
the OS
the implementation of malloc you are using
The OS is responsible for allocating the "virtual memory" that your process has access to and builds a translation table that maps the virtual memory back to actual memory addresses.
Now, the default implementation of malloc is generally conservative, and will simply have a giant lock around all this. This means that requests are processed serially, and the only thing that allocating from multiple threads instead of one does is slowing down the whole thing.
There are more clever allocation schemes, generally based upon pools, and they can be found in other malloc implementations: tcmalloc (from Google) and jemalloc (used by Facebook) are two such implementations designed for high-performance in multi-threaded applications.
There is no silver bullet though, and at one point the OS must perform the virtual <=> real translation which requires some form of locking.
Your best bet is to allocate by arenas:
Allocate big chunks (arenas) at once
Split them up in arrays of the appropriate size
There is no need to parallelize the arena allocation, and you'll be better off asking for the biggest arenas you can (do bear in mind that allocation requests for a too large amount may fail), then you can parallelize the split.
tcmalloc and jemalloc may help a bit, however they are not designed for big allocations (which is unusual) and I do not know if it is possible to configure the size of the arenas they request.
The answer depends on the memory allocations routine, which are a combination of a C++ library layer operator new, probably wrapped around libC malloc(), which in turn occasionally calls an OS function such as sbreak(). The implementation and performance characteristics of all of these is unspecified, and may vary from compiler version to version, with compiler flags, different OS versions, different OSes etc.. If profiling shows it's slower, then that's the bottom line. You can try varying the number of threads, but what's probably happening is that the threads are all trying to obtain the same lock in order to modify the heap... the overheads involved with saying "ok, thread X gets the go ahead next" and "thread X here, I'm done" are simply wasting time. Another C++ environment might end up using atomic operations to avoid locking, which might or might not prove faster... no general rule.
If you want to complete faster, consider allocating one array of 10000*1024 ints, then using different parts of it (e.g. [0]..[1023], [1024]..[2047]...).
I think that perhaps you need to adjust your expectation from multi-threading.
The main advantage of multi-threading is that you can do tasks asynchronously, i.e. in parallel. In your case, when your main thread needs more memory it does not matter whether it is allocated by another thread - you still need to stop and wait for allocation to be accomplished, so there is no parallelism here. In addition, there is an overhead of a thread signaling when it is done and the other waiting for completion, which just can degrade the performance. Also, if you start a thread each time you need allocation this is a huge overhead. If not, you need some mechanism to pass the allocation request and response between threads, a kind of task queue which again is an overhead without gain.
Another approach could be that the allocating thread runs ahead and pre-allocates the memory that you will need. This can give you a real gain, but if you are doing pre-allocation, you might as well do it in the main thread which will be simpler. E.g. allocate 10M in one shot (or 10 times 1M, or as much contiguous memory as you can have) and have an array of 10,000 pointers pointing to it at 1024 offsets, representing your arrays. If you don't need to deallocate them independently of one another this seems to be much simpler and could be even more efficient than using multi-threading.
As for glibc it has arena's (see here), which has lock per arena.
You may also consider tcmalloc by google (stands for Thread-Caching malloc), which shows 30% boost performance for threaded application. We use it in our project. In debug mode it even can discover some incorrect usage of memory (e.g. new/free mismatch)
As far as I know all os have implicit mutex lock inside the dynamic allocating system call (malloc...). If you think a moment about that, if you do not lock this action you could run into terrible problems.
You could use the multithreading api threading building blocks http://threadingbuildingblocks.org/
which has a multithreading friendly scalable allocator.
But I think a better idea should be to allocate the whole memory once(should work quite fast) and split it up on your own. I think the tbb allocator does something similar.
Do something like
new int[1024*10000] and than assign the parts of 1024ints to your pointer array or what ever you use.
Do you understand?
Because the heap is shared per-process the heap will be locked for each allocation, so it can only be accessed serially by each thread. This could explain the decrease of performance when you do alloc from multiple threads like you are doing.
If the arrays belong together and will only be freed as a whole, you can just allocate an array of 10000*1024 ints, and then make your individual arrays point into it. Just remember that you cannot delete the small arrays, only the whole.
int *all_arrays = new int[1024 * 10000];
int *small_array123 = all_arrays + 1024 * 123;
Like this, you have small arrays when you replace the 123 with a number between 0 and 9999.
The answer depends on the operating system and runtime used, but in most cases, you cannot.
Generally, you will have two versions of the runtime: a multi-threaded version and a single-threaded version.
The single-threaded version is not thread-safe. Allocations made by two threads at the same time can blow your application up.
The multi-threaded version is thread-safe. However, as far as allocations go on most common implementations, this just means that calls to malloc are wrapped in a mutex. Only one thread can ever be in the malloc function at any given time, so attempting to speed up allocations with multiple threads will just result in a lock convoy.
It may be possible that there are operating systems that can safely handle parallel allocations within the same process, using minimal locking, which would allow you to decrease time spent allocating. Unfortunately, I don't know of any.

How much memory does a thread consume when first created?

I understand that creating too many threads in an application isn't being what you might call a "good neighbour" to other running processes, since cpu and memory resources are consumed even if these threads are in an efficient sleeping state.
What I'm interested in is this: How much memory (win32 platform) is being consumed by a sleeping thread?
Theoretically, I'd assume somewhere in the region of 1mb (since this is the default stack size), but I'm pretty sure it's less than this, but I'm not sure why.
Any help on this will be appreciated.
(The reason I'm asking is that I'm considering introducing a thread-pool, and I'd like to understand how much memory I can save by creating a pool of 5 threads, compared to 20 manually created threads)
I have a server application which is heavy in thread usage, it uses a configurable thread pool which is set up by the customer, and in at least one site it has 1000+ threads, and when started up it uses only 50 MB. The reason is that Windows reserves 1MB for the stack (it maps its address space), but it is not necessarily allocated in the physical memory, only a smaller part of it. If the stack grows more than that a page fault is generated and more physical memory is allocated. I don't know what the initial allocation is, but I would assume it's equal to the page granularity of the system (usually 64 KB). Of course, the thread would also use a little more memory for other things when created (TLS, TSS, etc), but my guess for the total would be about 200 KB. And bear in mind that any memory that is not frequently used would be unloaded by the virtual memory manager.
Adding to Fabios comments:
Memory is your second concern, not your first. The purpose of a threadpool is usually to constrain the context switching overhead between threads that want to run concurrently, ideally to the number of CPU cores available.
A context switch is very expensive, often quoted at a few thousand to 10,000+ CPU cycles.
A little test on WinXP (32 bit) clocks in at about 15k private bytes per thread (999 threads created). This is the initial commited stack size, plus any other data managed by the OS.
If you're using Vista or Win2k8 just use the native Win32 threadpool API. Let it figure out the sizing. I'd also consider partitioning types of workloads e.g. CPU intensive vs. Disk I/O into different pools.
MSDN Threadpool API docs
http://msdn.microsoft.com/en-us/library/ms686766(VS.85).aspx
I think you'd have a hard time detecting any impact of making this kind of a change to working code - 20 threads down to 5. And then add on the added complexity (and overhead) of managing the thread pool. Maybe worth considering on an embedded system, but Win32?
And you can set the stack size to whatever you want.
This depends highly on the system:
But usually, each processes is independent. Usually the system scheduler makes sure that each processes gets equal access to the available processor. Thus a multi threaded application time is multiplexed between the available threads.
Memory allocated to a thread will affect the memory available to the processes but not the memory available to other processes. A good OS will page out unused stack space so it is not in physical memory. Though if your threads allocate enough memory while live you could cause thrashing as each processor's memory is paged to/from secondary device.
I doubt a sleeping thread has any (very little) impact on the system.
It is not using any CPU
Any memory it is using can be paged out to a secondary device.
I guess this can be measured quite easily.
Get the amount of resources used by the system before creating a thread
Create a thread with default system values (default heap size and others)
Get the amount of resources after creating a thread and make the difference (with step 1).
Note that some threads need to be specified different values than the default ones.
You can try and find an average memory use by creating various number of threads (step 2).
The memory allocated by the OS when creating a thread consists of threads local data: TCB TLS ...
From wikipedia: "Threads do not own resources except for a stack, a copy of the registers including the program counter, and thread-local storage (if any)."