why c++ openMP program execute longer

why c++ openMP program execute longer - c++

I've problem with understanding how it's possible.
I've long textfile (ten thousand lines), i read it to
variable text as string. I'd like to split it on 200 parts.
I've written this code using openMP directives:
std::string str[200];
omp_set_num_threads(200);
#pragma omp parallel
{
#pragma omp for
for (int i=0;i<200;i++)
{
str[i]= text.substr(i*(text.length()/200),text.length()/200);
}
}
and it's execution time is 231059 us
if i write it as sequence
for (int i=0;i<200;i++)
{
str[i]= text.substr(i*(text.length()/200),text.length()/200);
}
execution time is 215902us
Am I using openMP wrong, or what's happen here

substr causes a memory allocation and a memcpy, and not much else. So instead of 1 thread asking the OS for some ram, you now have N threads asking the OS for some RAM, at the same time. This isn't a great design.
Splitting a workload to be tackled by a thread group makes a lot of sense when the workload is CPU intensive. It makes a no sense at all, when all of those threads are competing for the same shared resource (e.g. the ram). One thread will simply block all the others until each allocation has been completed.

Related

What causes increasing memory consumption in OpenMP-based simulation?

The problem
I am having a big struggle with memory consumption in my Monte Carlo particle simulation, where I am using OpenMP for parallelization. Not going into the details of the simulation method, one parallel part are "particle moves" using some number of threads and the other are "scaling moves" using some, possibly different number of threads. This 2 parallel codes are run interchangeably separated by some serial core and each takes milliseconds to run.
I have an 8-core, 16-thread machine running Linux Ubuntu 18.04 LTS and I'am using gcc and GNU OpenMP implementation. Now:
using 8 threads for "particle moves" and 8 threads for "scaling moves" yields stable 8-9 MB memory usage
using 8 threads for "particle moves" and 16 threads for "scaling moves" causes increasing memory consumption from those 8 MB to tens of GB for long simulation resulting in the end in an OOM kill
using 16 threads and 16 threads is ok
using 16 threads and 8 threads causes increasing consumption
So something is wrong if numbers of threads for those 2 types of moves don't match.
Unfortunately, I was not able to reproduce the issue in a minimal example and I can only give a summary of the OpenMP code. A link to aminimal example is at the bottom.
In the simulation I have N particles with some positions. "Particle moves" are organized in a grid, I am using collapse(3) to distribute threads. The code looks more or less like this:
// Each threads has its own cell in a 2 x 2 x 2 grid
#pragma omp parallel for collapse(3) num_threads(8 or 16)
for (std::size_t i = 0; i < 2; i++) {
for (std::size_t j = 0; j < 2; j++) {
for (std::size_t k = 0; k < 2; k++) {
std::array<std::size_t, 3> gridCoords = {i, j, k};
// This does something for all particles in {i, j, k} grid cell
doIndependentParticleMovesInAGridCellGivenByCoords(gridCoords);
}
}
}
(Notice, that only 8 threads are to be distributed in both cases - 8 and 16, but using those additional, jobless 8 threads magically fixes the problem when 16 scaling threads are used.)
In "volume moves" I am doing an overlap check on each particle independently and exit when a first overlap is found. It looks like this:
// We independently check for each particle
std::atomic<bool> overlapFound = false;
#pragma omp parallel for num_threads(8 or 16)
for (std::size_t i = 0; i < N; i++) {
if (overlapFound)
continue;
if (isParticleOverlappingAnything(i))
overlapFound = true;
}
Now, in parallel regions I don't allocate any new memory and don't need any critical sections - there should be no race conditions.
Moreover, all memory management in the whole program is done in a RAII fashion by std::vector, std::unique_ptr, etc. - I don't use new or delete anywhere.
Investigation
I tried to use some Valgrind tools. I ran a simulation for a time, which produces about 16 MB of (still increasing) memory consumption for non-matching thread numbers case, while is stays still on 8 MB for matching case.
Valgrind Memcheck does not show any memory leaks (only a couple of kB "still reachable" or "possibly lost" from OpenMP control structures, see here) in either case.
Valgrind Massif reports only those "correct" 8 MB of allocated memory in both cases.
I also tried to surround the contents of main in { } and add while(true):
int main() {
{
// Do the simulation and let RAII do all the cleanup when destructors are called
}
// Hang
while(true) { }
}
During the simulation memory consumption increases let say up to 100 MB. When { ... } ends its execution, memory consumption gets lower by around 6 MB and stays at 94 in while(true) - 6 MB is the actual size of biggest data structures (I estimated it), but the remaining part is of an unknown kind.
Hypothesis
So I assume it must be something with OpenMP memory management. Maybe using 8 and 16 threads interchangeably causes OpenMP to constantly create new thread pools abandoning old ones without releasing resources? I found something like this here, but it seems to be another OpenMP implementation.
I would be very grateful for some ideas what else can I check and where might be the issue.
re #1201ProgramAlarm: I have changed volatile to std::atomic
re #Gilles: I have checked 16 threads case for "particle moves" and updated accordingly
Minimal example
I was finally able to reproduce the issue in a minimal example, it ended up being extremely simple and all the details here are unnecessary. I created a new question without all the mess here.

Where lies the problem?
It seem that the problem is not connected with what this particular code does or how the OpenMP clauses are structured, but solely with two alternating OpenMP parallel regions with different numbers of threads. After millions of those alterations there is a substantial amount of memory used by the process irregardless of what is in the sections. They may be even as simple as sleeping for a couple of milliseconds.
As this question contains too much unnecessary details I have moved the discussion to a more direct question here. I refer there the interested reader.
A brief summary of what happens
Here I give a brief summary of what StackOverflow members and I were able to determine. Let's say we have 2 OpenMP sections with different number of threads, such as here:
#include <unistd.h>
int main() {
while (true) {
#pragma omp parallel num_threads(16)
usleep(30);
#pragma omp parallel num_threads(8)
usleep(30);
}
return 0;
}
As described with more details here, OpenMP reuses common 8 threads, but other 8 needed for 16-thread section are constantly created and destroyed. This constant thread creation causes increasing memory consumption, either because of an actual memory leak, or memory fragmentation, I don't know. Moreover, the problem seems to be specific to GOMP OpenMP implementation in GCC (up to at least version 10). Clang and Intel compilers seem not to replicate the issue.
Although not stated explicitly by the OpenMP standard, most implementations tend to reuse the already spawned threads, but is seems not to be the case for GOMP and it is probably a bug. I will file the bug issue and update the answer. For now, the only workaround is to use the same number of threads in every parallel region (then GOMP properly reuses old threads). In cases like collapse loop from the question, when there are less threads to distribute than in the other section, one can always request 16 threads instead of 8 and let the other 8 just do nothing. It worked in my "production" code quite well.

Performance issues of multiple independent for loop with openMp

I am planning to use OpenMP threads for an intense computation. However, I couldn't acquire my expected performance in first trial. I thought I have several issues on it, but I have not assured yet. Generally, I am thinking the performance bottleneck is caused from fork and join model. Can you help me in some ways.
First, in a route cycle, running on a consumer thread, there is 2 independent for loops and some additional functions. The functions are located at end of the routine cycle and between the for loops, which is already seen below:
void routineFunction(short* xs, float* xf, float* yf, float* h)
{
// Casting
#pragma omp parallel for
for (int n = 0; n<1024*1024; n++)
{
xf[n] = (float)xs[n];
}
memset(yf,0,1024*1024*sizeof( float ));
// Filtering
#pragma omp parallel for
for (int n = 0; n<1024*1024-1024; n++)
{
for(int nn = 0; nn<1024; nn++)
{
yf[n]+=xf[n+nn]*h[nn];
}
}
status = DftiComputeBackward(hand, yf, yf); // Compute backward transform
}
Note: This code cannot be compilied, because I did it more readible as clearing details.
OpenMP thread number is set 8 dynamically. I observed the used threads in Windows taskbar. While thread number is increased by significantly, I didn't observe any performance improvement. I have some guesses, but I want to still discuss with you for further implementations.
My questions are these.
Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads? In case, the for loops cause fork and join at 2 times, how to align the code again. Is combining the two loops in a single loop sensible for saving performance, or using parallel region (#pragma omp parallel) and #pragma omp for (not #pragma omp parallel for) better choice for sharing works. I care about it forces me static scheduling by using thread id and thread numbers. According the document at page 34, static scheduling can cause load imbalance. Actually, I am familiar static scheduling because of CUDA programming, but I want to still avoid it, if there is any performance issue. I also read an answer in stackoverflow which points smart OpenMP algorithms do not join master thread after a parallel region is completed writed by Alexey Kukanov in last paragraph. How to utilize busy wait and sleep attributes of OpenMP for avoiding joining the master thread after first loop is completed.
Is there another reason for performance issue in the code?

This is mostly memory-bound code. Its performance and scalability are limited by the amount of data the memory channel can transfer per unit time. xf and yf take 8 MiB in total, which fits in the L3 cache of most server-grade CPUs but not of most desktop or laptop CPUs. If two or three threads are already able to saturate the memory bandwidth, adding more threads is not going to bring additional performance. Also, casting short to float is a relatively expensive operation - 4 to 5 cycles on modern CPUs.
Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
No, basically all OpenMP runtimes, including that of MSVC++, implement parallel regions using thread pools as this is the easiest way to satisfy the requirement of the OpenMP specification that thread-private variables retain their value between the different parallel regions. Only the very first parallel region suffers the full cost of starting new threads. Consequent regions reuse those threads and an additional price is paid only if more threads are needed that in any of the previously executed parallel regions. There is still some overhead, but it is far lower than that of starting new threads each time.
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads?
Yes, in your case two separate parallel regions are created. You can manually merge them into one:
#pragma omp parallel
{
#pragma omp for
for (int n = 0; n<1024*1024; n++)
{
xf[n] = (float)xs[n];
}
#pragma omp single
{
memset(yf,0,1024*1024*sizeof( float ));
//
// Other code that was between the two parallel regions
//
}
// Filtering
#pragma omp for
for (int n = 0; n<1024*1024-1024; n++)
{
for(int nn = 0; nn<1024; nn++)
{
yf[n]+=xf[n+nn]*h[nn];
}
}
}
Is there another reason for performance issue in the code?
It is memory-bound, or at least the two loops shown here are.

Alright, it's been a while since I did OpenMP stuff so hopefully I didn't mess any of this up... but here goes.
Forking and joining is the same thing as creating and destroying threads. How the cost compares to other threads (such as a C++11 thread) will be implementation dependent. I believe in general OpenMP threads might be slightly lighter-weight than C++11 threads, but I'm not 100% sure about that. You'd have to do some testing.
Currently each time routineFunction is called you will fork for the first for loop, join, do a memset, fork for the second loop, join, and then call DftiComputeBackward
You would be better off creating a parallel region as you stated. Not sure why the scheduling is an extra concern. It should be as easy as moving your memset to the top of the function, starting a parallel region using your noted command, and making sure each for loop is marked with #pragma omp for as you mentioned. You may need to put an explicit #pragma omp barrier in between the two for loops to make sure all threads finish the first for loop before starting the second... OpenMP has some implicit barriers but I forgot if #pragma omp for has one or not.
Make sure that the OpenMP compile flag is turned on for your compiler. If it isn't, the pragmas will be ignored, it will compile, and nothing will be different.
Your operations are prime for SIMD acceleration. You might want to see if your compiler supports auto-vectorization and if it is doing it. If not, I'd look into SIMD a bit, perhaps using intrinsics.
How much time does DftiComputeBackwards take relative to this code?

Performance problems using OpenMP in nested loops

I'm using the following code, which contains an OpenMP parallel for loop nested in another for-loop. Somehow the performance of this code is 4 Times slower than the sequential version (omitting #pragma omp parallel for).
Is it possible that OpenMp has to create Threads every time the method is called? In my test it is called 10000 times directly after each other.
I heard that sometimes OpenMP will keep the threads spinning. I also tried setting OMP_WAIT_POLICY=active and GOMP_SPINCOUNT=INFINITE. When I remove the openMP pragmas, the code is about 10 times faster. Note that the method containing this code will be called 10000 times.
for (round k = 1; k < processor.max; ++k) {
initialise_round(k);
for (std::vector<int> bucket : color_buckets) {
#pragma omp parallel for schedule (dynamic)
for (int i = 0; i < bucket.size(); ++i) {
if (processor.mark.is_marked_item(bucket[i])) {
processor.process(k, bucket[i]);
}
}
processor.finish_round(k);
}
}

You say that your sequential code is much faster so this makes me think that your processor.process function has too few instructions and duration. This leads to the case where passing the data to each thread does not pay off (the data exchange overhead is simply larger than the actual computation on that thread).
Other than that, I think that parallelizing the middle loop won't affect the algorithm but increase the amount of work per thread/

I think you are creating a team of threads on each iteration of the loop... (although I'm not sure what for alone does - I thought it should be parallel for). In this case, it would probably be better to separate the parallel from the for so the work of forking and creating the threads is done just once rather than being repeated in the other loops. So you could try to put a parallel pragma before your outermost loop so the overhead of forking and thread creation is just done once.

The actual problem was not related to OpenMP directly.
As the system has two CPUs, half of the threads where spawned on one and the other half on the other CPU. Therefore there was not shared L3 Cache. This lead in combination that the algorithm doesn't scale well to a performance decrease especially when using 2-4 Threads.
The solution was to use thread pinning for example via the linux tool: taskset

What is the best way to parallelise tasks sharing an object but otherwise independent?

I'm coding a physics simulation consisting mainly of a central loop of hundreds of billions of repetitions of operations on an array. These operations are independent from the other (well actually the array changes along the way) and so I'm thinking about parallelising my code as I can make it run on 4 or 8 cores computers in my lab.
It's my first time doing something alike and I've been advised to look at openmp. I've started to code some toy programs with it, but I'm really unsure about how it works and the documentation is quite cryptic to me. For example the following code:
int a = 0;
#pragma omp parallel
{
a++;
}
cout << a << endl;
launched on my computer (4 cores CPU) gives me sometimes 4, other times 3 or 2. Is it because it doesn't wait for all the cores to execute the instructions? Because I sure need to know how many iterations were done in my case. Should I look for something else than openmp considering what I want in the end?

When writing concurrently to a shared variable (a in your code), you have a data race. To avoid different threads writing "simultaneously", you must either use an atomic assignment or protect the assignment with a mutex (= mutual exclusion). In OpenMP, the latter is done via a critical region
int a = 0;
#pragma omp parallel
{
#pragma omp critical
{
a++;
}
}
cout << a << endl;
(of course, this particular program does nothing in parallel, hence will be slower than a serial one doing the same).
For more info, read the openMP documentation! However, I would advise you to not use OpenMP, but TBB if you're using C++. It's much more flexible.

What you are seeing is the typical example of a race condition. Four threads are trying to increment variable a and they are fighting for it. Some 'lose' and they are not able to increment so you see a result lower than 4.
What happens is that the a++ command is actually a set of three instructions: read a from memory and put it in a register, increment the value in the register, then put the value back in memory. If thread 1 reads the value of a after thread 2 has read it but before thread 2 has written the new value back to a, the increment operation of thread2 will be overwritten. Using #omp critical is a way to ensure that all the read/increment/write operations are not interrupted by another thread.
If you need to parallelize iterations, you can use omp parallel for, for instance to increment all the elements in an array.
Typical use:
#pragma omp parallel for
for (i = 0; i < N; i++)
a[i]++;

Concurrency and optimization using OpenMP

I'm learning OpenMP. To do so, I'm trying to make an existing code parallel. But I seems to get an worse time when using OpenMP than when I don't.
My inner loop:
#pragma omp parallel for
for(unsigned long j = 0; j < c_numberOfElements; ++j)
{
//int th_id = omp_get_thread_num();
//printf("thread %d, j = %d\n", th_id, (int)j);
Point3D current;
#pragma omp critical
{
current = _points[j];
}
Point3D next = getNext(current);
if (!hasConstraint(next))
{
continue;
}
#pragma omp critical
{
_points[j] = next;
}
}
_points is a pointMap_t, defined as:
typedef boost::unordered_map<unsigned long, Point3D> pointMap_t;
Without OpenMP my running time is 44.904s. With OpenMP enabled, on a computer with two cores, it is 64.224s. What I am doing wrong?

Why have you wrapped your reads and writes to _points[j] in critical sections ? I'm not much of a C++ programmer, but it doesn't look to me as if you need those sections at all. As you've written it (uunamed critical sections) each thread is going to wait while the other goes through each of the sections. This could easily make the program slower.

It seems possible that the lookup and write to _points in critical sections is dragging down the performance when you use OpenMP. Single-threaded, this will not result in any contention.
Sharing seed data like this seems counterproductive in a parallel programming context. Can you restructure to avoid these contention points?

You need to show the rest of the code. From a comment to another answer, it seems you are using a map. That is really a bad idea, especially if you are mapping 0..n numbers to values: why don't you use an array?
If you really need to use containers, consider using the ones from the Intel's Thread Building Blocks library.

I agree that it would be best to see some working code.
The ultimate issue here is that there are criticals within a parallel region, and criticals are (a) enormously expensive in and of themselves, and (b) by definition, kill parallelism. The assignment to current certainl doesn't need to be inside a critical, as it is private; I wouldn't have thought the _points[j] assignment would be, either, but I don't know what the map stuff does, so there you go.
But you have a loop in which you have a huge amount of overhead, which grows linearly in the number of threads (the two critical regions) in order to do a tiny amount of actual work (walk along a linked list, it looks like). That's never going to be a good trade...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js