Analyzing my C++ code with DRD (valgrind) finds a "Conflicting load", but I cannot see why. The code is as follows:
int* x;
int Nt = 2;
x = new int[Nt];
omp_set_num_threads(Nt);
#pragma omp parallel for
for (int i = 0; i < Nt; i++)
{
x[i] = i;
}
for (int i = 0; i < Nt; i++)
{
printf("%d\n", x[i]);
}
The program behaves well, but DRD sees an issue when the master thread prints out the value of x[1]. Apart from possible false sharing due to how the x array is allocated, I do not see why there should be any conflict, and how to avoid it... Any insights, please?
EDIT Here's the DRD output for the above code (line 47 corresponds to the printf statement):
==2369== Conflicting load by thread 1 at 0x06031034 size 4
==2369== at 0x4008AB: main (test.c:47)
==2369== Address 0x6031034 is at offset 4 from 0x6031030. Allocation context:
==2369== at 0x4C2DCC7: operator new[](unsigned long) (vg_replace_malloc.c:363)
==2369== by 0x400843: main (test.c:37)
==2369== Other segment start (thread 2)
==2369== at 0x4C31EB8: pthread_mutex_unlock (drd_pthread_intercepts.c:703)
==2369== by 0x4C2F00E: vgDrd_thread_wrapper (drd_pthread_intercepts.c:236)
==2369== by 0x5868D95: start_thread (in /lib64/libpthread-2.15.so)
==2369== by 0x5B6950C: clone (in /lib64/libc-2.15.so)
==2369== Other segment end (thread 2)
==2369== at 0x5446846: ??? (in /usr/lib64/gcc/x86_64-pc-linux-gnu/4.7.3/libgomp.so.1.0.0)
==2369== by 0x54450DD: ??? (in /usr/lib64/gcc/x86_64-pc-linux-gnu/4.7.3/libgomp.so.1.0.0)
==2369== by 0x4C2F014: vgDrd_thread_wrapper (drd_pthread_intercepts.c:355)
==2369== by 0x5868D95: start_thread (in /lib64/libpthread-2.15.so)
==2369== by 0x5B6950C: clone (in /lib64/libc-2.15.so)
GNU OpenMP runtime (libgomp) implements OpenMP thread teams using a pool of threads. After they are created, the threads sit docked at a barrier where they wait to be awaken to perform a specific task. In GCC these tasks come in the form of outlined (the opposite of inlined) code segments, i.e. the code for the parallel region (or for the explicit OpenMP task) is extracted into a separate function and that is supplied to some of the waiting threads as a task for execution. The docking barrier is then lifted and the threads start executing the task. Once that is finished, the threads are docked again - they are not joined, but simply put on hold. Therefore from DRD's perspective the master thread, which executes the serial part of the code after the parallel region, is accessing without protection resources that might be written to by the other threads. This of course cannot happen since the other threads are docked and waiting for a new task.
Such false positives are common with general tools like DRD that do not understand the specific semantics of OpenMP. Those tools are thus not suitable for analysis of OpenMP programs. You should use instead a specialised tool, e.g. the free Thread Analyzer from Sun/Oracle Solaris Studio for Linux or the commercial Intel Inspector. The latter can be used for free with a license for non-commercial development purposes. Both tools understand the specifics of OpenMP and won't present such situations as possible data races.
Related
Using Nsight Eclipse Edition 10.2 to debug a plain C++ code using gdb 7.11.1.
The code uses a pragma call to OpenMP for forking a for-loop.
The following is a minimal working example,
where a simple array q is filled with values of another variable p:
#pragma omp parallel for schedule (static)
for(int p=pstart; p<pend; p++){
const unsigned i = id[p];
if(start <= i && i < end)
q[i - start] = p;
}
In debug mode I would want use the step-in function (classically F5) to follow how the array q gets filled in with p's. However, that steps over the for loop altogether, and resumes where the parallel threads join again.
Is there a way to force stepping into a pragma directive/openMP loop?
Is there a way to force stepping into a pragma directive/openMP loop?
That will depend on the debugger, but it's also not entirely clear what it would mean. Since many threads execute the parallel loop, would you expect each of them to stop and then step together? How do you expect to show the different state of each thread? (each will have its own p and i). What happens if the thread control flow diverges?
There are debuggers which can do some of this (such as TotalView on Linux), but it's not trivial to do (and TotalView costs money [which is entirely fair and reasonable :-)]).
What you may need to do is set a breakpoint inside the loop, and then handle it being hit by N threads...
(Which doesn't answer your precise question, but does let you see what's going on in the loop, which is possibly what you really need to do!)
I wrote classic game "Life" with 4-sided neighbors. When I run it in debug, it says:
Consecutive version: 4.2s
Parallel version: 1.5s
Okey, it's good. But if I run it in release, it says:
Consecutive version: 0.46s
Parallel version: 1.23s
Why? I run it on the computer with 4 kernels. I run 4 threads in parallel section. Answer is correct. But somethere is leak and I don't know that place. Can anybody help me?
I try to run it in Visual Studio 2008 and 2012. The results are same. OMP is enabled in the project settings.
To repeat my problem, you can find defined constant PARALLEL and set it to 1 or 0 to enable and disable OMP correspondingly. Answer will be in the out.txt (out.txt - right answer example). The input must be in in.txt (my input - in.txt). There are some russian symbols, you don't need to understand them, but the first number in in.txt means number of threads to run in parallel section (it's 4 in the example).
The main part is placed in the StartSimulation function. If you run the program, you will see some russian text with running time in the console.
The program code is big enough, so I add it with file hosting - main.cpp (l2 means "lab 2" for me)
Some comments about StartSimulation function. I cuts 2D surface with cells into small rectangles. It is done by AdjustKernelsParameters function.
I do not find the ratio so strange. Having multiple threads co-operate is a complex business and has overheads.
Access to shared memory needs to be serialized which normally involves some form of locking mechanism and contention between threads where they have to wait for the lock to be released.
Such shared variables need to be synchronized between the processor cores which can give significant slowdowns. Also the compiler needs to treat these critical areas differently as a "sequence point".
All this reduces the scope for per thread optimization both in the processor hardware and the compiler for each thread when it is working with the shared variable.
It seems that in this case the overheads of parallelization outweigh the optimization possibilities for the single threaded case.
If there were more work for each thread to do independently before needed to access a shared variable then these overheads would be less significant.
You are using guided loop schedule. This is a very bad choice given that you are dealing with a regular problem where each task can easily do exactly the same amount of work as any other if the domain is simply divided into chunks of equal size.
Replace schedule(guided) with schedule(static). Also employ sum reduction over livingCount instead of using locked increments:
#if PARALLEL == 1
#pragma omp parallel for schedule(static) num_threads(kernelsCount) \
reduction(+:livingCount)
#endif
for (int offsetI = 0; offsetI < n; offsetI += kernelPartSizeN)
{
for (int offsetJ = 0; offsetJ < m; offsetJ += kernelPartSizeM)
{
int boundsN = min(kernelPartSizeN, n - offsetI),
boundsM = min(kernelPartSizeM, m - offsetJ);
for (int kernelOffsetI = 0; kernelOffsetI < boundsN; ++kernelOffsetI)
{
for (int kernelOffsetJ = 0; kernelOffsetJ < boundsM; ++kernelOffsetJ)
{
if(BirthCell(offsetI + kernelOffsetI, offsetJ + kernelOffsetJ))
{
++livingCount;
}
}
}
}
}
In C/C++ How can I make the threads(POSIX pthreads/Windows threads) to give me a safe method to pass progress back to the main thread on the progress of the execution or my work that I’ve decided to perform with the thread.
Is it possible to report the progress in terms of percentage ?
I'm going to assume a very simple case of a main thread, and one function. What I'd recommend is passing in a pointer to an atomic (as suggested by Kirill above) for each time you launch the thread. Assuming C++11 here.
using namespace std;
void threadedFunction(atomic<int>* progress)
{
for(int i = 0; i < 100; i++)
{
progress->store(i); // updates the variable safely
chrono::milliseconds dura( 2000 );
this_thread::sleep_for(dura); // Sleeps for a bit
}
return;
}
int main(int argc, char** argv)
{
// Make and launch 10 threads
vector<atomic<int>> atomics;
vector<thread> threads;
for(int i = 0; i < 10; i++)
{
atomics.emplace_back(0);
threads.emplace_back(threadedFunction, &atomics[i]);
}
// Monitor the threads down here
// use atomics[n].load() to get the value from the atomics
return 0;
}
I think that'll do what you want. I omitted polling the threads, but you get the idea. I'm passing in an object that both the main thread and the child thread know about (in this case the atomic<int> variable) that they both can update and/or poll for results. If you're not on a full C++11 thread/atomic support compiler, use whatever your platform determines, but there's always a way to pass a variable (at the least a void*) into the thread function. And that's how you get something to pass information back and forth via non-statics.
The best way to solve this is to use C++ atomics for that. Declare in some visible enough place:
std::atomic<int> my_thread_progress(0);
In a simple case this should be a static variable, in a more complex place this should be a data field of some object that manages threads or something similar.
On many platforms this will be slightly paranoiac because almost everywhere the read and write operations on integers are atomic. Bit using atomics still it makes because:
You will have guarantee that this will work fine on any platform, even on a 16 bit CPU or whatever unusual hardware;
Your code will be easier to read. Reader will immediately see that this is shared variable without placing any comments. Once it will be updated with load/store methods, it will be easier to catch on what is going on.
EDIT
Intel® 64 and IA-32 Architectures Software Developer’s Manual
Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B and 3C (http://download.intel.com/products/processor/manual/325462.pdf)
Volume 3A: 8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
I have been using boost threads on 32bit linux for some time and am very happy with their performance so far. Recently the project was moved to a 64bit platform and we saw a huge increase in memory usage (from about 2.5gb to 16-17gb). I have done profiling and found that the boost threads are the source of the huge allocation. Each thread is allocating about 10x what it was doing on 32bit.
I profiled using valgrind's massif and have confirmed the issue using only boost threads in a separate test application. I also tried using std::threads instead and these do not exhibit the large memory allocation issue.
I am wondering if anyone else has seen this behaviour and knows what the problem is? Thanks.
There's no problem. This is virtual memory, and each 64-bit process can allocate terabytes of virtual memory on every modern operating system. It's basically free and there's no reason to care about how much of it used.
It's basically just reserved space for thread stacks. You can reduce it, if you want, by changing the default stack size. But there's absolutely no reason to.
1. stack size of per-thread
use pthread_attr_getstacksize to view. use boost::thread::attributes to change (pthread_attr_setstacksize).
2. pre-mmap for per-thread in glibc's malloc
gdb example of boost.thread
0 0x000000000040ffe0 in boost::detail::get_once_per_thread_epoch() ()
1 0x0000000000407c12 in void boost::call_once<void (*)()>(boost::once_flag&, void (*)()) [clone .constprop.120] ()
2 0x00000000004082cf in thread_proxy ()
3 0x000000000041120a in start_thread (arg=0x7ffff7ffd700) at pthread_create.c:308
4 0x00000000004c5cf9 in clone ()
5 0x0000000000000000 in ?? ()
you will discover data=malloc(sizeof(boost::uintmax_t)); in get_once_per_thread_epoch ( boost_1_50_0/libs/thread/src/pthread/once.cpp )
continue
1 0x000000000041a0d3 in new_heap ()
2 0x000000000041b045 in arena_get2.isra.5.part.6 ()
3 0x000000000041ed13 in malloc ()
4 0x0000000000401b1a in test () at pthread_malloc_8byte.cc:9
5 0x0000000000402d3a in start_thread (arg=0x7ffff7ffd700) at pthread_create.c:308
6 0x00000000004413d9 in clone ()
7 0x0000000000000000 in ?? ()
in new_heap function (glibc-2.15\malloc\arena.c), it will pre-mmap 64M memory for per-thread in 64bit os. in other words, per-thread will use 64M + 8M (default thread stack) = 72M.
glibc-2.15\ChangeLog.17
2009-03-13 Ulrich Drepper <drepper#redhat.com>
* malloc/malloc.c: Implement PER_THREAD and ATOMIC_FASTBINS features.
* malloc/arena.c: Likewise.
* malloc/hooks.c: Likewise.
http://wuerping.github.io/blog/malloc_per_thread.html
So, I'm implementing a program with multiple threads (pthreads), and I am looking for help on a few points. I'm doing c++ on linux. All of my other questions have been answered by Google so far, but there are still two that I have not found answers for.
Question 1: I am going to be doing a bit of file I/O and web-page getting/processing within my threads. Is there anyway to guarantee what the threads do to be atomic? I am going to be letting my program run for quite a while, more than likely, and it won't really have a predetermined ending point. I am going to be catching the signal from a ctrl+c and I want to do some cleanup afterwards and still want my program to print out results/close files, etc.
I'm just wondering if it is reasonable behavior for the program to wait for the threads to complete or if I should just kill all the threads/close the file and exit. I just don't want my results to be skewed. Should I/can I just do a pthread_exit() in the signal catching method?
Any other comments/ideas on this would be nice.
Question 2: Valgrind is saying that I have some possible memory leaks. Are these avoidable, or does this always happen with threading in c++? Below are two of the six or so messages that I get when checking with valgrind.
I have been looking at a number of different websites, and one said that some possible memory leaks could be because of sleeping a thread. This doesn't make sense to me, nevertheless, I am currently sleeping the threads to test the setup I have right now (I'm not actually doing any real I/O at the moment, just playing with threads).
==14072== 256 bytes in 1 blocks are still reachable in loss record 4 of 6
==14072== at 0x402732C: calloc (vg_replace_malloc.c:467)
==14072== by 0x400FDAC: _dl_check_map_versions (dl-version.c:300)
==14072== by 0x4012898: dl_open_worker (dl-open.c:269)
==14072== by 0x400E63E: _dl_catch_error (dl-error.c:178)
==14072== by 0x4172C51: do_dlopen (dl-libc.c:86)
==14072== by 0x4052D30: start_thread (pthread_create.c:304)
==14072== by 0x413A0CD: clone (clone.S:130)
==14072==
==14072== 630 bytes in 1 blocks are still reachable in loss record 5 of 6
==14072== at 0x402732C: calloc (vg_replace_malloc.c:467)
==14072== by 0x400A8AF: _dl_new_object (dl-object.c:77)
==14072== by 0x4006067: _dl_map_object_from_fd (dl-load.c:957)
==14072== by 0x4007EBC: _dl_map_object (dl-load.c:2250)
==14072== by 0x40124EF: dl_open_worker (dl-open.c:226)
==14072== by 0x400E63E: _dl_catch_error (dl-error.c:178)
==14072== by 0x4172C51: do_dlopen (dl-libc.c:86)
==14072== by 0x4052D30: start_thread (pthread_create.c:304)
==14072== by 0x413A0CD: clone (clone.S:130)
I am creating my threads with:
rc = pthread_create(&threads[t], NULL, thread_stall, (void *)NULL);
(rc = return code). At the end of the entry point, I call pthread_exit().
Here's my take:
1.If you want your threads to exit gracefully (killing them with open file or socket handles is never a good idea), have them loop on a termination flag:
while(!stop)
{
do work
}
Then when you catch the ctrl-c set the flag to true and then join them. Make sure to declare stop as std::atomic<bool> to make sure all the threads see the updated value. This way they will finish the current batch of work and then exit gracefully when checking the condition next time.
2.I don't have enough information about your code to answer this.