I test a simple helloword on my mac under C++ with the opnemp library, via terminal using the following two commandes:
/usr/local/bin/clang++-omp -fopenmp helloworld.cpp -o test
/usr/local/bin/valgrind --tool=helgrind --log-file=a.log ./test
The output is correct:
warning: no debug symbols in executable (-arch x86_64)
Hello World from thread = 0
Hello World from thread = 1
Hello World from thread = 3
Hello World from thread = 2
Number of threads = 4
but the log file (a.log) contains: "174986 errors from 231 contexts"(as error summary)
Here's a part of the log file:
==643== ---Thread-Announcement------------------------------------------
==643==
==643== Thread #1 is the program's root thread
==643==
==643== ----------------------------------------------------------------
==643==
==643== Possible data race during read of size 4 at 0x10057C118 by thread #3
==643== Locks held: none
==643== at 0x10055D1F4: spin_lock (in /usr/lib/system/libsystem_platform.dylib)
==643== by 0x10057092D: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==643== by 0x10056E384: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==643==
==643== This conflicts with a previous write of size 4 by thread #1
==643== Locks held: none
==643== at 0x10055D200: spin_unlock (in /usr/lib/system/libsystem_platform.dylib)
==643== by 0x1000434B0: __kmp_create_worker (in /usr/local/Cellar/libiomp/20150701/lib/libiomp5.dylib)
==643== by 0x100031E3D: __kmp_allocate_thread (in /usr/local/Cellar/libiomp/20150701/lib/libiomp5.dylib)
==643== by 0x10002E7A1: __kmp_allocate_team (in /usr/local/Cellar/libiomp/20150701/lib/libiomp5.dylib)
==643== by 0x10002FA2D: __kmp_fork_call (in /usr/local/Cellar/libiomp/20150701/lib/libiomp5.dylib)
==643== by 0x100027F0D: __kmpc_fork_call (in /usr/local/Cellar/libiomp/20150701/lib/libiomp5.dylib)
==643== by 0x100000CE8: main (in ./test)
==643== Address 0x10057c118 is in the Data segment of /usr/lib/system/libsystem_pthread.dylib
The code of the "helloworld" is:
#include <stdio.h>
#include <libiomp/omp.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int nthreads=4, tid;
#pragma omp parallel num_threads(nthreads) private(tid)
{
//Obtain thread number
tid = omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);
// Only master thread does this
if (tid == 0)
{
printf("Number of threads = %d\n", nthreads);
}
}
return 0;
}
Does anyone have an idea of these errors (data race)? I do not have a shared data between these threads.
This is most likely a false positive. See for example a similar discussion for the same issue with libgomp. It could also be an actual problem in the libiomp / pthread implementation, but that seems rather unlikely.
There seems to be little that you can do about. In general, if the top of the stack is in a library, it is either a false positive or bug in the library, or you are misusing it (e.g. running memcpy on a buffer from multipl threads).
If your binary is at the top of the stack, it is more clearly an issue with your code.
Your code is fine wrt. to data races.
Here is a C++ program that runs 10 times with 5 different threads and each thread increments the value of counter so the final output should be 500, which is exactly what the program is giving output. But i cant understand why is it giving 500 every time the output should be different as the increment operation is not atomic and there are no locks used so the program should give out different outputs in each case.
edit to increase probability of race condition i increased the loop count but still couldn't see any varying output
#include <iostream>
#include <thread>
#include <vector>
struct Counter {
int value;
Counter() : value(0){}
void increment(){
value = value + 1000;
}
};
int main(){
int n = 50000;
while(n--){
Counter counter;
std::vector<std::thread> threads;
for(int i = 0; i < 5; ++i){
threads.push_back(std::thread([&counter](){
for(int i = 0; i < 1000; ++i){
counter.increment();
}
}));
}
for(auto& thread : threads){
thread.join();
}
std::cout << counter.value << std::endl;
}
return 0;
}
You're just lucky :)
Compiling with clang++ my output is not always 500:
500
425
470
500
500
500
500
500
432
440
Note
Using g++ with -fsanitize=thread -static-libtsan:
WARNING: ThreadSanitizer: data race (pid=13871)
Read of size 4 at 0x7ffd1037a9c0 by thread T2:
#0 Counter::increment() <null> (Test+0x000000509c02)
#1 main::{lambda()#1}::operator()() const <null> (Test+0x000000507ed1)
#2 _M_invoke<> /usr/include/c++/5/functional:1531 (Test+0x0000005097d7)
#3 operator() /usr/include/c++/5/functional:1520 (Test+0x0000005096b2)
#4 _M_run /usr/include/c++/5/thread:115 (Test+0x0000005095ea)
#5 <null> <null> (libstdc++.so.6+0x0000000b8c7f)
Previous write of size 4 at 0x7ffd1037a9c0 by thread T1:
#0 Counter::increment() <null> (Test+0x000000509c17)
#1 main::{lambda()#1}::operator()() const <null> (Test+0x000000507ed1)
#2 _M_invoke<> /usr/include/c++/5/functional:1531 (Test+0x0000005097d7)
#3 operator() /usr/include/c++/5/functional:1520 (Test+0x0000005096b2)
#4 _M_run /usr/include/c++/5/thread:115 (Test+0x0000005095ea)
#5 <null> <null> (libstdc++.so.6+0x0000000b8c7f)
shows the race condition. (Also, on my system the output shows results different than 500).
The options for g++ are explained in the documentage for g++ (e.g.: man g++). See also: https://github.com/google/sanitizers/wiki#threadsanitizer.
Just because your code has race conditions does not mean they occur. That is the hard part about them. A lot of times they only occur when something else changes and timing is different.
here are several issues: incrementing to 100 can be done really fast. So your threads may be already halfway done before the second one is started. Same for the next thread etc. So you never know you have really 5 in parallel.
You should create a barrier at the beginning of each thread to make sure they start all at the same time.
Also maybe try a bit more than "100" and only 5 threads. But it all depends on the system / load / timing. etc.
to increase probability of race condition i increased the loop count
but still couldn't see any varying output
Strictly speaking you have data race in this code which is Undefined Behavior and therefore you cannot reliably reproduce it.
But you can rewrite Counter to some "equivalent" code with artificial delays in increment:
struct Counter {
int value;
Counter() : value(0){}
void increment(){
int val=value;
std::this_thread::sleep_for(std::chrono::milliseconds(1));
++val;
value=val;
}
};
I've got the following output with this counter which is far less than 500:
100
100
100
100
100
101
100
100
101
100
I got a strange phenomenon in openMP with shared monery and print function.
I tested this problem both in C++ and Fortran.
In C++:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
int main (int argc, char *argv[])
{
int i=1;
#pragma omp parallel sections shared(i)
{
#pragma omp section
{while(true){
i = 1;
printf("thread 1: %i\n", i);
}}
#pragma omp section
{while(true){
i = i - 1000;
printf("thread 2: %i\n", i);
}}
}
}
This code is quite simple and the expected result is something like this:
thread 1: 1
thread 2: -999
thread 1: 1
thread 2: -999
thread 2: -1999
thread 1: 1
However, I could get this result:
thread 1: 1
thread 2: -1726999
thread 2: -1727999
thread 2: -1728999
thread 2: -1729999
thread 2: -1730999
thread 2: -1731999
thread 2: -1732999
It is confusing and looks like i is not shared! I tried to commented this line:
printf("thread 1: %i\n", i);
and got:
thread 2: 1
thread 2: -999
thread 2: 1
thread 2: 1
thread 2: -999
thread 2: 1
It looks fine now.
In Fortan:
OpenMP performances a little different in Fortran.
PROGRAM test
implicit none
integer*8 i
i = 1
!$OMP parallel sections shared(i)
!$OMP section
do
i = 1
print *, "thread 1" ,i
!call sleep(1)
end do
!$OMP section
do
i = i-1000
print *, "thread 2" ,i
!call sleep(1)
end do
!$OMP end parallel sections
END PROGRAM
This code lead to the same problem as above. But if I comment the thread 1's print, the problem is still there.
I have to add sleep subroutine as the commented lines to get the expected result.
Anyone know the reason?
Another question, can a variable being modified in one thread as the same time as be reading in another thread?
You are modifying a shared variable from more than one thread without synchronization. This is known as a data race. The result of your program is unspecified - anything can happen. The same applies if you are writing to a variable in one thread and reading from another without synchronization.
See section 1.4.1 of the OpenMP 4.0 standard for more information.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I'm kinda new to pthreads and I'm trying to create a program that sorts 1 million randomly generated integers. I seem to have lost a bit of control over the threads. When run the first time, the code only produces a single thread, but when subsequently run, the thread, the threads multiply out of control. Since I don't really know precisely where the problem lies, I've provided the code below.
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <iostream>
#define N 8 /* # of thread */
#define NUM_INTS 10000 //ideally should be able to sort 1,000,000
int int_list[NUM_INTS];
/* structure for array index
* used to keep low/high end of sub arrays
*/
typedef struct Arr {
int low;
int high;
} ArrayIndex;
void merge(int low, int high) {
int mid = (low+high)/2;
int left = low;
int right = mid+1;
int list_b[high-low+1];
volatile int i, cur = 0;
while((left <= mid) && (right <= high)) {
if (int_list[left] > int_list[right])
list_b[cur++] = int_list[right++];
else
list_b[cur++] = int_list[right++];
}
while(left <= mid)
list_b[cur++] = int_list[left++];
while(right <= high)
list_b[cur++] = int_list[left++];
for (i = 0; i < (high-low+1) ; i++)
int_list[low+i] = list_b[i];
}
void * mergesort(void *a){
ArrayIndex *pa = (ArrayIndex *)int_list;
int mid = (pa->low + pa->high)/2;
ArrayIndex aIndex[N];
pthread_t thread[N];
aIndex[0].low = pa->low;
aIndex[0].high = mid;
aIndex[1].low = mid+1;
aIndex[1].high = pa->high;
if (pa->low >= pa->high)
return 0;
volatile int i;
for(i = 0; i < N; i++)
pthread_create(&thread[i], NULL, mergesort, &aIndex[i]);
for(i = 0; i < N; i++)
pthread_join(thread[i], NULL);
merge(pa->low, pa->high);
pthread_exit(NULL);
}
int main(){
volatile int i;
struct timeval start_time, end_time;
srand(getpid());
for(i=0; i<NUM_INTS; i++)
int_list[i] = rand();
ArrayIndex ai;
ai.low = 0;
ai.high = NUM_INTS/sizeof(int_list[0])-1;
pthread_t thread;
pthread_create(&thread, NULL, mergesort, &ai);
pthread_join(thread, NULL);
return 0;
}
gdb output:
(gdb) run
Starting program: /.../sort.o
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6fd5700 (LWP 25801)]
[Thread 0x7ffff6fd5700 (LWP 25801) exited]
Computation Time: 38006 micro-seconds.
[Inferior 1 (process 25797) exited normally]
(gdb) run
Starting program: /.../sort.o
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6fd5700 (LWP 25804)]
[New Thread 0x7ffff67d4700 (LWP 25805)]
[New Thread 0x7ffff5fd3700 (LWP 25806)]
[New Thread 0x7ffff57d2700 (LWP 25807)]
[New Thread 0x7ffff4fd1700 (LWP 25808)]
[New Thread 0x7fffef7fe700 (LWP 25811)]
[New Thread 0x7fffeeffd700 (LWP 25810)]
...
[New Thread 0x7ffeca6ec700 (LWP 26148)]
Program received signal SIGINT, Interrupt.
[Switching to Thread 0x7ffee8728700 (LWP 26088)]
__GI___nptl_create_event () at events.c:25
25 events.c: No such file or directory.
The problem is that you try to implement recursive divide-and-conquer parallelism by starting a new thread for each sub-problem, up to the point when a thread is given a single array item to "sort". This approach is just plain wrong for multiple reasons. To give you just one, sorting an array of 1 million items would require a million of threads at leaf calls of the recursion, and another million at all recursion levels above. Even if you introduce some grain size - a threshold after which the recursion becomes serial - the total amount of threads would still likely be very big, unless your threshold is something like NUM_INTS/N.
Even not counting the above, your implementation has some bugs:
at each level of recursion, you start N threads, even though the work is split just in halves. aIndex[i] is uninitialized for i>1, so corresponding threads receive garbage in their input parameter.
you cast int_list, which is a pointer to int, to a pointer to ArrayIndex.
There are a few ways how you may fix the design:
The simple one is to introduce a proper threshold after which the recursion becomes serial, as I said above.
The more complex one - but also more generic and flexible - is to implement a pool of threads and a pool/queue of tasks processed by the threads; so when you split the given array in halves you create two tasks for processing each half and submit these tasks to the work queue where threads take work from. Note that for good performance you would still require to set some grain size in order to have sufficient amount of work per task, but that would be much smaller threshold than the one necessary to limit the number of threads.
The right one, especially for production code, is to take a library or parallel technology that has proper primitives for recursive parallelism, such as Intel's Threading Building Blocks (tbb) or Microsoft's Parallel Patterns Library (ppl).
See also some links (and generally, google for "parallel merge sort C++")
performance problems in parallel mergesort C++
https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-for-tbb-cilk-plus-and-openmp
http://www.drdobbs.com/parallel/parallel-merge-sort/229400239
I'm new to multithreading in Windows, so this might be a trivial question: what's the easiest way of making sure that threads perform a loop in lockstep?
I tried passing a shared array of Events to all threads and using WaitForMultipleObjects at the end of the loop to synchronize them, but this gives me a deadlock after one, sometimes two, cycles. Here's a simplified version of my current code (with just two threads, but I'd like to make it scalable):
typedef struct
{
int rank;
HANDLE* step_events;
} IterationParams;
int main(int argc, char **argv)
{
// ...
IterationParams p[2];
HANDLE step_events[2];
for (int j=0; j<2; ++j)
{
step_events[j] = CreateEvent(NULL, FALSE, FALSE, NULL);
}
for (int j=0; j<2; ++j)
{
p[j].rank = j;
p[j].step_events = step_events;
AfxBeginThread(Iteration, p+j);
}
// ...
}
UINT Iteration(LPVOID pParam)
{
IterationParams* p = (IterationParams*)pParam;
int rank = p->rank;
for (int i=0; i<100; i++)
{
if (rank == 0)
{
printf("%dth iteration\n",i);
// do something
SetEvent(p->step_events[0]);
WaitForMultipleObjects(2, p->step_events, TRUE, INFINITE);
}
else if (rank == 1)
{
// do something else
SetEvent(p->step_events[1]);
WaitForMultipleObjects(2, p->step_events, TRUE, INFINITE);
}
}
return 0;
}
(I know I'm mixing C and C++, it's actually legacy C code that I'm trying to parallelize.)
Reading the docs at MSDN, I think this should work. However, thread 0 only prints once, occasionally twice, and then the program hangs. Is this a correct way of synchronizing threads? If not, what would you recommend (is there really no built-in support for a barrier in MFC?).
EDIT: this solution is WRONG, even including Alessandro's fix. For example, consider this scenario:
Thread 0 sets its event and calls Wait, blocks
Thread 1 sets its event and calls Wait, blocks
Thread 0 returns from Wait, resets its event, and completes a cycle without Thread 1 getting control
Thread 0 sets its own event and calls Wait. Since Thread 1 had no chance to reset its event yet, Thread 0's Wait returns immediately and the threads go out of sync.
So the question remains: how does one safely ensure that the threads stay in lockstep?
Introduction
I implemented a simple C++ program for your consideration (tested in Visual Studio 2010). It is using only Win32 APIs (and standard library for console output and a bit of randomization). You should be able to drop it into a new Win32 console project (without precompiled headers), compile and run.
Solution
#include <tchar.h>
#include <windows.h>
//---------------------------------------------------------
// Defines synchronization info structure. All threads will
// use the same instance of this struct to implement randezvous/
// barrier synchronization pattern.
struct SyncInfo
{
SyncInfo(int threadsCount) : Awaiting(threadsCount), ThreadsCount(threadsCount), Semaphore(::CreateSemaphore(0, 0, 1024, 0)) {};
~SyncInfo() { ::CloseHandle(this->Semaphore); }
volatile unsigned int Awaiting; // how many threads still have to complete their iteration
const int ThreadsCount;
const HANDLE Semaphore;
};
//---------------------------------------------------------
// Thread-specific parameters. Note that Sync is a reference
// (i.e. all threads share the same SyncInfo instance).
struct ThreadParams
{
ThreadParams(SyncInfo &sync, int ordinal, int delay) : Sync(sync), Ordinal(ordinal), Delay(delay) {};
SyncInfo &Sync;
const int Ordinal;
const int Delay;
};
//---------------------------------------------------------
// Called at the end of each itaration, it will "randezvous"
// (meet) all the threads before returning (so that next
// iteration can begin). In practical terms this function
// will block until all the other threads finish their iteration.
static void RandezvousOthers(SyncInfo &sync, int ordinal)
{
if (0 == ::InterlockedDecrement(&(sync.Awaiting))) { // are we the last ones to arrive?
// at this point, all the other threads are blocking on the semaphore
// so we can manipulate shared structures without having to worry
// about conflicts
sync.Awaiting = sync.ThreadsCount;
wprintf(L"Thread %d is the last to arrive, releasing synchronization barrier\n", ordinal);
wprintf(L"---~~~---\n");
// let's release the other threads from their slumber
// by using the semaphore
::ReleaseSemaphore(sync.Semaphore, sync.ThreadsCount - 1, 0); // "ThreadsCount - 1" because this last thread will not block on semaphore
}
else { // nope, there are other threads still working on the iteration so let's wait
wprintf(L"Thread %d is waiting on synchronization barrier\n", ordinal);
::WaitForSingleObject(sync.Semaphore, INFINITE); // note that return value should be validated at this point ;)
}
}
//---------------------------------------------------------
// Define worker thread lifetime. It starts with retrieving
// thread-specific parameters, then loops through 5 iterations
// (randezvous-ing with other threads at the end of each),
// and then finishes (the thread can then be joined).
static DWORD WINAPI ThreadProc(void *p)
{
ThreadParams *params = static_cast<ThreadParams *>(p);
wprintf(L"Starting thread %d\n", params->Ordinal);
for (int i = 1; i <= 5; ++i) {
wprintf(L"Thread %d is executing iteration #%d (%d delay)\n", params->Ordinal, i, params->Delay);
::Sleep(params->Delay);
wprintf(L"Thread %d is synchronizing end of iteration #%d\n", params->Ordinal, i);
RandezvousOthers(params->Sync, params->Ordinal);
}
wprintf(L"Finishing thread %d\n", params->Ordinal);
return 0;
}
//---------------------------------------------------------
// Program to illustrate iteration-lockstep C++ solution.
int _tmain(int argc, _TCHAR* argv[])
{
// prepare to run
::srand(::GetTickCount()); // pseudo-randomize random values :-)
SyncInfo sync(4);
ThreadParams p[] = {
ThreadParams(sync, 1, ::rand() * 900 / RAND_MAX + 100), // a delay between 200 and 1000 milliseconds will simulate work that an iteration would do
ThreadParams(sync, 2, ::rand() * 900 / RAND_MAX + 100),
ThreadParams(sync, 3, ::rand() * 900 / RAND_MAX + 100),
ThreadParams(sync, 4, ::rand() * 900 / RAND_MAX + 100),
};
// let the threads rip
HANDLE t[] = {
::CreateThread(0, 0, ThreadProc, p + 0, 0, 0),
::CreateThread(0, 0, ThreadProc, p + 1, 0, 0),
::CreateThread(0, 0, ThreadProc, p + 2, 0, 0),
::CreateThread(0, 0, ThreadProc, p + 3, 0, 0),
};
// wait for the threads to finish (join)
::WaitForMultipleObjects(4, t, true, INFINITE);
return 0;
}
Sample Output
Running this program on my machine (dual-core) yields the following output:
Starting thread 1
Starting thread 2
Starting thread 4
Thread 1 is executing iteration #1 (712 delay)
Starting thread 3
Thread 2 is executing iteration #1 (798 delay)
Thread 4 is executing iteration #1 (477 delay)
Thread 3 is executing iteration #1 (104 delay)
Thread 3 is synchronizing end of iteration #1
Thread 3 is waiting on synchronization barrier
Thread 4 is synchronizing end of iteration #1
Thread 4 is waiting on synchronization barrier
Thread 1 is synchronizing end of iteration #1
Thread 1 is waiting on synchronization barrier
Thread 2 is synchronizing end of iteration #1
Thread 2 is the last to arrive, releasing synchronization barrier
---~~~---
Thread 2 is executing iteration #2 (798 delay)
Thread 3 is executing iteration #2 (104 delay)
Thread 1 is executing iteration #2 (712 delay)
Thread 4 is executing iteration #2 (477 delay)
Thread 3 is synchronizing end of iteration #2
Thread 3 is waiting on synchronization barrier
Thread 4 is synchronizing end of iteration #2
Thread 4 is waiting on synchronization barrier
Thread 1 is synchronizing end of iteration #2
Thread 1 is waiting on synchronization barrier
Thread 2 is synchronizing end of iteration #2
Thread 2 is the last to arrive, releasing synchronization barrier
---~~~---
Thread 4 is executing iteration #3 (477 delay)
Thread 3 is executing iteration #3 (104 delay)
Thread 1 is executing iteration #3 (712 delay)
Thread 2 is executing iteration #3 (798 delay)
Thread 3 is synchronizing end of iteration #3
Thread 3 is waiting on synchronization barrier
Thread 4 is synchronizing end of iteration #3
Thread 4 is waiting on synchronization barrier
Thread 1 is synchronizing end of iteration #3
Thread 1 is waiting on synchronization barrier
Thread 2 is synchronizing end of iteration #3
Thread 2 is the last to arrive, releasing synchronization barrier
---~~~---
Thread 2 is executing iteration #4 (798 delay)
Thread 3 is executing iteration #4 (104 delay)
Thread 1 is executing iteration #4 (712 delay)
Thread 4 is executing iteration #4 (477 delay)
Thread 3 is synchronizing end of iteration #4
Thread 3 is waiting on synchronization barrier
Thread 4 is synchronizing end of iteration #4
Thread 4 is waiting on synchronization barrier
Thread 1 is synchronizing end of iteration #4
Thread 1 is waiting on synchronization barrier
Thread 2 is synchronizing end of iteration #4
Thread 2 is the last to arrive, releasing synchronization barrier
---~~~---
Thread 3 is executing iteration #5 (104 delay)
Thread 4 is executing iteration #5 (477 delay)
Thread 1 is executing iteration #5 (712 delay)
Thread 2 is executing iteration #5 (798 delay)
Thread 3 is synchronizing end of iteration #5
Thread 3 is waiting on synchronization barrier
Thread 4 is synchronizing end of iteration #5
Thread 4 is waiting on synchronization barrier
Thread 1 is synchronizing end of iteration #5
Thread 1 is waiting on synchronization barrier
Thread 2 is synchronizing end of iteration #5
Thread 2 is the last to arrive, releasing synchronization barrier
---~~~---
Finishing thread 4
Finishing thread 3
Finishing thread 2
Finishing thread 1
Note that for simplicity each thread has random duration of iteration, but all iterations of that thread will use that same random duration (i.e. it doesn't change between iterations).
How does it work?
The "core" of the solution is in the "RandezvousOthers" function. This function will either block on a shared semaphore (if the thread on which this function was called was not the last one to call the function) or reset Sync structure and unblock all the threads blocking on a shared semaphore (if the thread on which this function was called was the last one to call the function).
To have it work, set the second parameter of CreateEvent to TRUE. This will make the events "manual reset" and prevents the Waitxxx to reset it.
Then place a ResetEvent at the beginning of the loop.
I found this SyncTools (download SyncTools.zip) by googling "barrier synchronization windows". It uses one CriticalSection and one Event to implement a barrier for N threads.